Manual Schedulling strategies in K8s

Manual Scheduling strategies of K8s resources

Scheduling a certain type of k8s resource in a specific node is very crucial in large scale deployments, as some of your applications may need GPUs, some needs to run on certain geographic region node. In these cases, we need to control the resources/workloads in which node they should run.

To counter this scenario k8s brings,

  1. Taint and Tolerations
  2. Node Labels & Selectors
  3. Node affinity

Taint and Tolerations

In this you Taint a node with a key, value pair, and if the resource can tolerate that that it will be scheduled on that on by scheduler. But, if the resource can't then it will be scheduled in some other node. Taint and toleration modify a only node to accept certain kind of resource, but the resource can be scheduled in any other node. So basically, we can't control the resource where it will be scheduled but can prevent only the node from unwanted resources.

This can be used in 3 ways.

  1. NoSchedule This will prevent new resources to schedule in the node but will not remove the running/existing one.

  2. PreferNoSchedule It will try the same as NoSchedule, but it will just prefer to do but will not prevent them forcefully.

  3. NoExecute This will prevent new resources to schedule in the node and also remove the running/existing one.

To taint a node

kubectl taint node NODE-NAME KEY=VALUE:NoSchedule/PreferNoSchedule/NoExecute

example

kubectl taint node ai-workload-node GPU=True:NoExecute

Now this node only accepts the AI workloads where the GPU=True toleration will present.

Below should be the syntax of adding toleration in resource manifest file. This operator can Equal or Exists. Equal is same "=“ and Exists will just check the KEY name and will schedule it regardless of the value, but the node should be also configured that way.

tolerations:
- key: "GPU"
  operator: "Equal"
  value: "True"
  effect: "NoExecute"

Node Labels & Selectors

This will add a label on that node, and the resources having the same selector will be scheduled on that node. Now this gives control also on the resource level so that now it can also choose in which node it will be scheduled. Which was missing in taint and toleration.

Adding Label to a node

kubectl label node NODE-NAME KEY=VALUE

example yaml kubectl label node pre-production-node env=pre-prod

Syntax of configuring the resource with selector

nodeSelector:
 env: "pre-prod"

Now which every node having the label "env=pre-prod" will allow this resource and now this resource also will be scheduled only on these nodes. But Node Selector only supports " = and != " these 2 operators, and Node Affinity gives more advantages and customizations.

Node Affinity

Here also the Node will be labelled with some KEY=VALUE, but in the resource manifest section it gives more flexibility to schedule the resource. This gives 6 operators, In, NotIn, Exists, DoesNotExist, Gt, Lt. The Not in the prefix of In and Exists gives node anti-affinity features also. It does exactly inverse of affinity, when matches it will not proceed. " requiredDuringSchedulingIgnoredDuringExecution ", " preferredDuringSchedulingIgnoredDuringExecution " are two more advancements. The first tries the schedulling in mandatory way, and the second one tries in preferred way. But, both with will not impact the existing ones, as in the end "IgnoredDuringExecution" is there.

Now suppose node having label as " region=us-east " and " disktype=ssd " , and the manifest is configured like below,

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: disktype
            operator: In
            values:
            - ssd
          - key: region
            operator: In
            values:
            - us-east
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
          - key: environment
            operator: In
            values:
            - production

So, in this case In " requiredDuringSchedulingIgnoredDuringExecution " will act as mandatory part and will schedule the resource, In operator will check any of the KEY=VALUE pair. If any single of them present it will proceed.

If another node having the label both " region=us-east " and " disktype=ssd " along with " environment=production " In this case reesource will schedule on this node. As it will now do " preferredDuringSchedulingIgnoredDuringExecution " and here this has this matching preference also. The weight field hold a number between 1-100, this sets prority to the resource on certain node. For example, if one more KEY=VALUE pair is present with higher weight than this one " environment=production ", then this resource will be schedule on that node instead of this.