K8s resource creation

Scheduling of resources in k8s and Static Pods

Whenever a new manifest file is applied through kubectl it first goes to the K8s API server in then control plane, then etcd does the checks about resources and on the available information, scheduler is notified to create the resource. Now in this case if it is not defined that in which node the resource to be created Scheduler does this by performing its own checks on available computing resources like memory and all.

Now if we predefine that in which node the resource to be created then kubelet automatically create the resource bypassing the scheduler. Now if scheduler is responisible for creating the resources then who creates scheduler pod, and the other control plane components. These are called static pods. In every Kubernetes cluster in this

/etc/kubernetes/manifests/

path if you move any k8s manifests kubelet creates the resource as it always watches this directory and kubelet itself is not a Pod at all (at least not initially). It’s a system-level process (a binary) that runs directly on each node’s operating system. You can check with

systemctl status kubelet

command.

Now if you put any resource manifest in /etc/kubernetes/manifests/ directory then static pods will be created, If don't specify about the node location of that resource scheduler will automatically create that in some node. If you want to create in your desired node then

nodeSelector:
  kubernetes.io/hostname: NODE_NAME

put this in the end of your yaml, and now kubelet will create that resource bypassing the scheduler.

Manual Scheduling strategies of K8s resources

Scheduling a certain type of k8s resource in a specific node is very crucial in large scale deployments, as some of your applications may need GPUs, some needs to run on certain geographic region node. In these cases, we need to control the resources/workloads in which node they should run.

To counter this scenario k8s brings,

  1. Taint and Tolerations
  2. Node Labels & Selectors
  3. Node affinity

Taint and Tolerations

In this you Taint a node with a key, value pair, and if the resource can tolerate that that it will be scheduled on that on by scheduler. But, if the resource can't then it will be scheduled in some other node. Taint and toleration modify a only node to accept certain kind of resource, but the resource can be scheduled in any other node. So basically, we can't control the resource where it will be scheduled but can prevent only the node from unwanted resources.

This can be used in 3 ways.

  1. NoSchedule This will prevent new resources to schedule in the node but will not remove the running/existing one.

  2. PreferNoSchedule It will try the same as NoSchedule, but it will just prefer to do but will not prevent them forcefully.

  3. NoExecute This will prevent new resources to schedule in the node and also remove the running/existing one.

To taint a node

kubectl taint node NODE-NAME KEY=VALUE:NoSchedule/PreferNoSchedule/NoExecute

example

kubectl taint node ai-workload-node GPU=True:NoExecute

Now this node only accepts the AI workloads where the GPU=True toleration will present.

Below should be the syntax of adding toleration in resource manifest file. This operator can Equal or Exists. Equal is same "=“ and Exists will just check the KEY name and will schedule it regardless of the value, but the node should be also configured that way.

tolerations:
- key: "GPU"
  operator: "Equal"
  value: "True"
  effect: "NoExecute"

Node Labels & Selectors

This will add a label on that node, and the resources having the same selector will be scheduled on that node. Now this gives control also on the resource level so that now it can also choose in which node it will be scheduled. Which was missing in taint and toleration.

Adding Label to a node

kubectl label node NODE-NAME KEY=VALUE

example yaml kubectl label node pre-production-node env=pre-prod

Syntax of configuring the resource with selector

nodeSelector:
 env: "pre-prod"

Now which every node having the label "env=pre-prod" will allow this resource and now this resource also will be scheduled only on these nodes. But Node Selector only supports " = and != " these 2 operators, and Node Affinity gives more advantages and customizations.

Node Affinity

Here also the Node will be labelled with some KEY=VALUE, but in the resource manifest section it gives more flexibility to schedule the resource. This gives 6 operators, In, NotIn, Exists, DoesNotExist, Gt, Lt. The Not in the prefix of In and Exists gives node anti-affinity features also. It does exactly inverse of affinity, when matches it will not proceed. " requiredDuringSchedulingIgnoredDuringExecution ", " preferredDuringSchedulingIgnoredDuringExecution " are two more advancements. The first tries the schedulling in mandatory way, and the second one tries in preferred way. But, both with will not impact the existing ones, as in the end "IgnoredDuringExecution" is there.

Now suppose node having label as " region=us-east " and " disktype=ssd " , and the manifest is configured like below,

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: disktype
            operator: In
            values:
            - ssd
          - key: region
            operator: In
            values:
            - us-east
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
          - key: environment
            operator: In
            values:
            - production

So, in this case In " requiredDuringSchedulingIgnoredDuringExecution " will act as mandatory part and will schedule the resource, In operator will check any of the KEY=VALUE pair. If any single of them present it will proceed.

If another node having the label both " region=us-east " and " disktype=ssd " along with " environment=production " In this case reesource will schedule on this node. As it will now do " preferredDuringSchedulingIgnoredDuringExecution " and here this has this matching preference also. The weight field hold a number between 1-100, this sets prority to the resource on certain node. For example, if one more KEY=VALUE pair is present with higher weight than this one " environment=production ", then this resource will be schedule on that node instead of this.