MultiTenancy on Kubernetes
Problem
Workload for Client specific task can be made to run on client specific hardware , They can be achieved by Node Selector or Taint / Tolerances , Both path way have perk and cons , This document provides that information.
Sometimes client have their own hardware where their task , Apps and other long running workload run. Current Rancher master has reached EOL and we are planning to move to current K8s Platform . Due to better functionality and redundancy resource requirements have increased thus we are looking for better solution for client cluster. Finally solution is to merge those node into MT K8s cluster but still keeping them segregated by VLANs and some mechanism for separation of workload .
Following methods can be used for workload segregation.
Node Selectors via Labels / Affinity : https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity
Taint and Tolerances : https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/
Solution
What is multi-tenant Kubernetes?
- Multi-tenant Kubernetes is a Kubernetes deployment where multiple applications or workloads run side-by-side.
- Multi-tenancy is a common architecture for organizations that have multiple applications running in the same environment, or where different teams (like developers, and IT Ops) share the same Kubernetes environment.
- You can think of multi-tenant Kubernetes as being akin to an apartment building, whereas single-tenant Kubernetes is like a single-family house.
Option 1 : Taint & Tolerances
Description:
Taints are a Kubernetes feature that repel pods from nodes. When a node receives a taint, all pods that do not have a matching tolerance will be repelled from that node.
This method works as DENY ALL UNLESS ALLOWED .
Example :
# Node Taint
>>> kubectl taint nodes test-node-1 my-taint=test:NoSchedule
>>> kubectl describe node ip-192-168-101-21.us-west-2.compute.internal
Name: test-node-1
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=m4.xlarge
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=us-west-2
failure-domain.beta.kubernetes.io/zone=us-west-2a
kubernetes.io/hostname=ip-192-168-101-21.us-west-2.compute.internal
pipeline-nodepool-name=pool1
Annotations: node.alpha.kubernetes.io/ttl=0
volumes.kubernetes.io/controller-managed-attach-detach=true
CreationTimestamp: Wed, 29 Aug 2018 11:31:53 +0200
Taints: my-taint=test:NoSchedule
#Pod Tolerance
spec:
tolerations:
- key: "my-taint"
operator: Equal
value: "test"
Pro:
Pods from other orgs/namespaces are guaranteed to not schedule on nodes that are owned by certain clients. Best security and workload isolation between clients.
- This will require the least amount of work on software side. Only pods by clients would need changes to their pod configuration. ( Its same amount either way ;p )
Cons:
- In case of failure of nodes , Spilling Client work on MT cluster will be extremely difficult
- Debugging failure of container by starting test container will require tolerance thus additional configuration by pod ( lots of work)
- Starting Client specific workload by helm requires additional configuration
- Each of Master/ System level config needs to be added tolerance for given taints.
Option 2 : Node Selectors via Labels / Affinity
Description:
Here each pod will start on only nodes labeled with certain labels.
Example :
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/e2e-az-name
operator: In
values:
- e2e-az1
- e2e-az2
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: another-node-label-key
operator: In
values:
- another-node-label-value
containers:
- name: with-node-affinity
image: k8s.gcr.io/pause:2.0
Simpler Config:
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
env: test
spec:
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
nodeSelector:
disktype: ssd
Pro:
- Better fault tolerance and greater capacity for all clients. If a tainted node fails, other nodes could have trouble starting up all failed pods which would increase recovery time.
- Moving Workload around during failure is lot easier.
- Better ease in starting debug workloads.
- Easier and Detailed configuration for rules including weight based scheduling
- Future support for APPs Frame work.
Cons:
- Node selector have chance to start on different client node if workload has not specified to start on MT nodes. ( MOSTLY MANUALLY STARTED PODS for debugging : This should be reduces as much as we can . )
- Node selector can run into extreme detailed configuration ( hard to grasp)