Using autoscaler functionality in Tanzu Kubernetes Grid 1.2.1

The 1.2.1 release of TKG does not come with many new features over the 1.2.0 release but one very important one is the inclusion of autoscaler functionality. Enabling and configuring it on your cluster is fairly straightforward and I’ll walk through it in this post.

My TKG 1.2.1 deployment is running on top of vSphere 7.0 U1 and starts out with a management cluster that is comprised of a single control plane node and a single worker node (both small sized). The Kubernetes version in play is 1.19.3 but you could also run 1.18.10 or 1.17.13.

tkg get mc

  MANAGEMENT-CLUSTER-NAME  CONTEXT-NAME                   STATUS
  tkg-mgmt-as *            tkg-mgmt-as-admin@tkg-mgmt-as  Success
kubectl get nodes

 NAME                                STATUS   ROLES    AGE     VERSION
 tkg-mgmt-as-control-plane-z6t9v     Ready    master   6m15s   v1.19.3+vmware.1
 tkg-mgmt-as-md-0-79b5ff4b89-vhfcc   Ready             5m1s    v1.19.3+vmware.1

We can only enable autoscaler functionality on workload clusters so I’ll be creating one next, but first I need to set a few environment variables so that the autoscaler is configured properly. You can examine the .tkg/providers/config_default.yaml file to see what the available autoscaler options are.

grep AUTOSCALER .tkg/providers/config_default.yaml

 ENABLE_AUTOSCALER: false
 AUTOSCALER_MAX_NODES_TOTAL: "0"
 AUTOSCALER_SCALE_DOWN_DELAY_AFTER_ADD: "10m"
 AUTOSCALER_SCALE_DOWN_DELAY_AFTER_DELETE: "10s"
 AUTOSCALER_SCALE_DOWN_DELAY_AFTER_FAILURE: "3m"
 AUTOSCALER_SCALE_DOWN_UNNEEDED_TIME: "10m"
 AUTOSCALER_MAX_NODE_PROVISION_TIME: "15m"
 AUTOSCALER_MIN_SIZE_0:
 AUTOSCALER_MAX_SIZE_0:
 AUTOSCALER_MIN_SIZE_1:
 AUTOSCALER_MAX_SIZE_1:
 AUTOSCALER_MIN_SIZE_2:
 AUTOSCALER_MAX_SIZE_2:

The ENABLE_AUTOSCALER option is self-explanatory and I won’t be using/setting it in lieu of using the --enable-cluster-options flag to the tkg create cluster command. The following should help to explain the other options:

  • AUTOSCALER_MAX_NODES_TOTAL – The maximum number of Kubernetes worker nodes for the entire cluster (this must be set higher than the default of zero for the autoscaler to function)
  • AUTOSCALER_SCALE_DOWN_DELAY_AFTER_ADD – The amount of time that the system will wait after a scale up operation before resuming scale down evaluations.
  • AUTOSCALER_SCALE_DOWN_DELAY_AFTER_DELETE – The amount of time that the system will wait after a node deletion before resuming scale down evaluations.
  • AUTOSCALER_SCALE_DOWN_DELAY_AFTER_FAILURE – The amount of time that the system will wait after a scale down failure before resuming scale down evaluations.
  • AUTOSCALER_SCALE_DOWN_UNNEEDED_TIME – The amount of time a node is deemed unnecessary before it is appropriate for the autoscaler to scale down the cluster and delete the node.
  • AUTOSCALER_MAX_NODE_PROVISION_TIME – The amount of time that the system will wait for a new node to be created during a scale up operation.
  • AUTOSCALER_MIN_SIZE_0 – The lowest number of Kubernetes worker nodes that the system will scale down to for a single availability zone (AZ). This value is for AZ 0 and the AUTOSCALER_MIN_SIZE_1 and AUTOSCALER_MIN_SIZE_2 values correspond to AZ 1 and AZ 2 respectively. The 1 and values are currently only applicable to TKG clusters running on AWS as multi-AZ support is not yet available for vSphere or Azure. If you don’t set this value, it will default to the WORKER_MACHINE_COUNT_# value.
  • AUTOSCALER_MAX_SIZE_0 – The highest number of Kubernetes worker nodes that the system will scale up to for a single availability zone (AZ). This value is for AZ 0 and the AUTOSCALER_MAX_SIZE_1 and AUTOSCALER_MAX_SIZE_2 values correspond to AZ 1 and AZ 2 respectively. The 1 and 2 values are currently only applicable to TKG clusters running on AWS as multi-AZ support is not yet available for vSphere or Azure. If you don’t set this value, it will default to the WORKER_MACHINE_COUNT_# value.

One thing worth paying special attention to are the AUTOSCALER_MIN_SIZE_0 and AUTOSCALER_MAX_SIZE_0 values. If you don’t set either of these, they will both default to the WORKER_MACHINE_COUNT_0 value and the autoscaler will effectively do nothing as the lower and upper bounds within which it can work are the same. To allow for scale up operations, you must at least set the AUTOSCALER_MAX_SIZE_0 to something higher than the WORKER_MACHINE_COUNT_0 value. Not setting the AUTOSCALER_MIN_SIZE_0 value will not prohibit scale down operations as they can happen if the cluster has been scaled up. You could set this value to something lower than the WORKER_MACHINE_COUNT_0 value to allow the AZ to be sized smaller than its initial configuration if it was oversized.

With all of this in mind, I’m going to set some environment variables and then deploy my cluster.

export WORKER_MACHINE_COUNT=2
export AUTOSCALER_MIN_SIZE_0=1
export AUTOSCALER_MAX_SIZE_0=5
tkg create cluster tkg-wld-as -p dev --enable-cluster-options autoscaler --vsphere-controlplane-endpoint 192.168.110.103 -v 6

One of the first things that happens is that a new deployment is created in the default namespace of the management cluster.

kubectl get deployment

 NAME                            READY   UP-TO-DATE   AVAILABLE   AGE
 tkg-wld-as-cluster-autoscaler   1/1     1            1           77s

If we do a describe on this deployment we can see that it will be running the autoscaler image against the new cluster (output has been truncated)

kubectl describe deployment tkg-wld-as-cluster-autoscaler

Containers:
    tkg-wld-as-cluster-autoscaler:
     Image:      registry.tkg.vmware.run/cluster-autoscaler:v1.19.1_vmware.1
     Port:       
     Host Port:  
     Command:
       /cluster-autoscaler
     Args:
       --cloud-provider=clusterapi
       --v=4
       --clusterapi-cloud-config-authoritative
       --kubeconfig=/mnt/tkg-wld-as-kubeconfig/value
       --node-group-auto-discovery=clusterapi:clusterName=tkg-wld-as
       --scale-down-delay-after-add=10m
       --scale-down-delay-after-delete=10s
       --scale-down-delay-after-failure=3m
       --scale-down-unneeded-time=10m
       --max-node-provision-time=15m
       --max-nodes-total=0

You can do a describe on the machinedeployment and check the Annotations to validate that the autoscaler settings provided via environment variables were set.

kubectl describe machinedeployment tkg-wld-as-md-0

 Name:         tkg-wld-as-md-0
 Namespace:    default
 Labels:       cluster.x-k8s.io/cluster-name=tkg-wld-as
 Annotations:  cluster.k8s.io/cluster-api-autoscaler-node-group-max-size: 5
               cluster.k8s.io/cluster-api-autoscaler-node-group-min-size: 1
               machinedeployment.clusters.x-k8s.io/revision: 1

If you set the verbosity high you’ll see that the cluster creation process is waiting for the autoscaler to be online (for a very short time)

Waiting for cluster autoscaler to be available…
Waiting for resource tkg-wld-as-cluster-autoscaler of type *v1.Deployment to be up and running

Once the cluster is up and running we can see that I’ve got one control plane node and two workers.

kubectl get nodes 

 NAME                               STATUS   ROLES    AGE     VERSION
 tkg-wld-as-control-plane-kkz2p     Ready    master   5m31s   v1.19.3+vmware.1
 tkg-wld-as-md-0-58588b7b5f-g29gc   Ready             4m6s    v1.19.3+vmware.1
 tkg-wld-as-md-0-58588b7b5f-wsvqp   Ready             4m4s    v1.19.3+vmware.1

By investigating the logs in the autoscaler pod in the management cluster, we can see that it is already deciding that two worker nodes is more than is needed.

kubectl logs tkg-wld-as-cluster-autoscaler-7678c4744b-rls67

 I1210 20:23:21.252087       1 clusterapi_controller.go:555] node "tkg-wld-as-md-0-58588b7b5f-wsvqp" is in nodegroup "MachineDeployment/default/tkg-wld-as-md-0"
 I1210 20:23:21.252266       1 clusterapi_controller.go:555] node "tkg-wld-as-md-0-58588b7b5f-g29gc" is in nodegroup "MachineDeployment/default/tkg-wld-as-md-0"
 I1210 20:23:21.252337       1 static_autoscaler.go:492] tkg-wld-as-md-0-58588b7b5f-g29gc is unneeded since 2020-12-10 20:16:51.178779067 +0000 UTC m=+116.454260415 duration 6m28.860466285s
 I1210 20:23:21.252383       1 static_autoscaler.go:492] tkg-wld-as-md-0-58588b7b5f-wsvqp is unneeded since 2020-12-10 20:16:02.161944324 +0000 UTC m=+67.437425653 duration 7m17.877301047s
 I1210 20:23:21.252421       1 static_autoscaler.go:503] Scale down status: unneededOnly=true lastScaleUpTime=2020-12-10 20:14:55.363937358 +0000 UTC m=+0.639418683 lastScaleDownDeleteTime=2020-12-10 20:14:55.363937433 +0000 UTC m=+0.639418761 lastScaleDownFailTime=2020-12-10 20:14:55.363937511 +0000 UTC m=+0.639418837 scaleDownForbidden=false isDeleteInProgress=false scaleDownInCooldown=true
 I1210 20:23:21.446182       1 request.go:581] Throttling request took 193.349574ms, request: GET:https://100.64.0.1:443/apis/cluster.x-k8s.io/v1alpha3/namespaces/default/machinedeployments/tkg-wld-as-md-0/scale
 I1210 20:23:21.450888       1 clusterapi_provider.go:70] discovered node group: MachineDeployment/default/tkg-wld-as-md-0 (min: 1, max: 5, replicas: 2)

Once the cluster has been up for 10 minutes with practically no load, we’ll see the autoscaler remove a node (output is truncated).

kubectl logs tkg-wld-as-cluster-autoscaler-7678c4744b-rls67

  I1210 20:26:15.046159       1 static_autoscaler.go:516] Starting scale down
  I1210 20:26:14.846856       1 scale_down.go:421] Node tkg-wld-as-md-0-58588b7b5f-wsvqp - cpu utilization 0.200000
  I1210 20:26:14.846923       1 scale_down.go:421] Node tkg-wld-as-md-0-58588b7b5f-g29gc - cpu utilization 0.200000
  I1210 20:26:14.847323       1 clusterapi_controller.go:555] node "tkg-wld-as-md-0-58588b7b5f-wsvqp" is in nodegroup "MachineDeployment/default/tkg-wld-as-md-0"
  I1210 20:26:15.037546       1 request.go:581] Throttling request took 189.955197ms, request: GET:https://100.64.0.1:443/apis/cluster.x-k8s.io/v1alpha3/namespaces/default/machinedeployments/tkg-wld-as-md-0/scale
  I1210 20:26:15.044372       1 clusterapi_controller.go:555] node "tkg-wld-as-md-0-58588b7b5f-g29gc" is in nodegroup "MachineDeployment/default/tkg-wld-as-md-0"
  I1210 20:26:15.044422       1 cluster.go:148] Fast evaluation: tkg-wld-as-md-0-58588b7b5f-g29gc for removal
  I1210 20:26:15.044437       1 cluster.go:185] Fast evaluation: node tkg-wld-as-md-0-58588b7b5f-g29gc may be removed
  I1210 20:26:15.046076       1 static_autoscaler.go:492] tkg-wld-as-md-0-58588b7b5f-g29gc is unneeded since 2020-12-10 20:16:51.178779067 +0000 UTC m=+116.454260415 duration 9m22.657249955s
  I1210 20:26:15.046092       1 static_autoscaler.go:492] tkg-wld-as-md-0-58588b7b5f-wsvqp is unneeded since 2020-12-10 20:16:02.161944324 +0000 UTC m=+67.437425653 duration 10m11.674084717s
  I1210 20:26:15.658016       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"77fb9bcb-cfcc-4b83-8694-d76fa51744ae", APIVersion:"v1", ResourceVersion:"3843", FieldPath:""}): type: 'Normal' reason: 'ScaleDownEmpty' Scale-down: removing empty node tkg-wld-as-md-0-58588b7b5f-wsvqp
  I1210 20:26:17.058060       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"tkg-wld-as-md-0-58588b7b5f-wsvqp", UID:"f06ac578-6ab9-4910-b61c-62726509e52e", APIVersion:"v1", ResourceVersion:"3567", FieldPath:""}): type: 'Normal' reason: 'ScaleDown' node removed by cluster autoscaler

We can see that the cluster has one fewer node than it did just a few minutes ago since node node tkg-wld-as-md-0-58588b7b5f-wsvqp was just removed.

kubectl get nodes 

 NAME                               STATUS   ROLES    AGE   VERSION
 tkg-wld-as-control-plane-kkz2p     Ready    master   15m   v1.19.3+vmware.1
 tkg-wld-as-md-0-58588b7b5f-g29gc   Ready             14m   v1.19.3+vmware.1

To see how the autoscaler will add more nodes to the cluster, we’ll have to create some load. I’ll do this by deploying the php-apache web server application noted at Horizontal Pod Autoscaler Walkthrough. The only difference I’m making is that I’ve set the number of replicas to 3.

kubectl get po

 NAME                         READY   STATUS    RESTARTS   AGE
 php-apache-d4cf67d68-2w7wz   1/1     Running   0          112s
 php-apache-d4cf67d68-jxw9s   1/1     Running   0          112s
 php-apache-d4cf67d68-l78ff   1/1     Running   0          112s

These pods aren’t doing much and there is still not enough load for the autoscaler to kick in and provision a new node. If I increase the number of replicas to 10, there is an almost immediate response from the autoscaler to create a new node.

kubectl get po

 NAME                         READY   STATUS    RESTARTS   AGE
 php-apache-d4cf67d68-2267j   1/1     Running   0          2m44s
 php-apache-d4cf67d68-2w7wz   1/1     Running   0          20m
 php-apache-d4cf67d68-78bm6   1/1     Running   0          2m44s
 php-apache-d4cf67d68-b2vnh   1/1     Running   0          2m44s
 php-apache-d4cf67d68-j5fqs   1/1     Running   0          2m44s
 php-apache-d4cf67d68-jxw9s   1/1     Running   0          20m
 php-apache-d4cf67d68-l78ff   1/1     Running   0          20m
 php-apache-d4cf67d68-l7mmj   1/1     Running   0          2m44s
 php-apache-d4cf67d68-s29xc   1/1     Running   0          2m44s
 php-apache-d4cf67d68-w6fwb   1/1     Running   0          2m44s
kubectl --context=tkg-wld-as-admin@tkg-wld-as describe po php-apache-d4cf67d68-s29xc

Events:
   Type     Reason            Age                    From                Message
   ----     ------            ----                   ----                -------
   Warning  FailedScheduling  3m43s (x2 over 3m43s)  default-scheduler   0/2 nodes are available: 1 Insufficient cpu, 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
   Normal   TriggeredScaleUp  3m35s                  cluster-autoscaler  pod triggered scale-up: [{MachineDeployment/default/tkg-wld-as-md-0 1->2 (max: 5)}]
   Warning  FailedScheduling  2m24s (x4 over 2m54s)  default-scheduler   0/3 nodes are available: 1 Insufficient cpu, 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 1 node(s) had taint {node.cloudprovider.kubernetes.io/uninitialized: true}, that the pod didn't tolerate.
   Normal   Scheduled         2m14s                  default-scheduler   Successfully assigned default/php-apache-d4cf67d68-s29xc to tkg-wld-as-md-0-58588b7b5f-nfzqs

If we look at the nodes present in the cluster now, we’ll see that there is a new one named tkg-wld-as-md-0-58588b7b5f-nfzqs.

kubectl get nodes

 NAME                               STATUS   ROLES    AGE     VERSION
 tkg-wld-as-control-plane-kkz2p     Ready    master   68m     v1.19.3+vmware.1
 tkg-wld-as-md-0-58588b7b5f-g29gc   Ready             66m     v1.19.3+vmware.1
 tkg-wld-as-md-0-58588b7b5f-nfzqs   Ready             4m11s   v1.19.3+vmware.1

And looking in the autoscaler logs, we can see the details behind what triggered the new node to be created (output is truncated).

kubectl logs tkg-wld-as-cluster-autoscaler-7678c4744b-rls67

 I1210 21:17:43.586296       1 filter_out_schedulable.go:65] Filtering out schedulables
 I1210 21:17:43.586369       1 filter_out_schedulable.go:132] Filtered out 0 pods using hints
 I1210 21:17:43.781018       1 filter_out_schedulable.go:170] 1 pods were kept as unschedulable based on caching
 I1210 21:17:43.781069       1 filter_out_schedulable.go:171] 0 pods marked as unschedulable can be scheduled.
 I1210 21:17:43.781092       1 filter_out_schedulable.go:82] No schedulable pods
 I1210 21:17:43.796755       1 klogx.go:86] Pod default/php-apache-d4cf67d68-2267j is unschedulable
 I1210 21:17:43.796817       1 klogx.go:86] Pod default/php-apache-d4cf67d68-s29xc is unschedulable
 I1210 21:17:44.412791       1 scale_up.go:456] Best option to resize: MachineDeployment/default/tkg-wld-as-md-0
 I1210 21:17:44.412850       1 scale_up.go:460] Estimated 1 nodes needed in MachineDeployment/default/tkg-wld-as-md-0
 I1210 21:17:44.588553       1 scale_up.go:574] Final scale-up plan: [{MachineDeployment/default/tkg-wld-as-md-0 1->2 (max: 5)}]
 I1210 21:17:44.593599       1 scale_up.go:663] Scale-up: setting group MachineDeployment/default/tkg-wld-as-md-0 size to 2
 I1210 21:17:44.653323       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"77fb9bcb-cfcc-4b83-8694-d76fa51744ae", APIVersion:"v1", ResourceVersion:"18369", FieldPath:""}): type: 'Normal' reason: 'ScaledUpGroup' Scale-up: setting group MachineDeployment/default/tkg-wld-as-md-0 size to 2
 W1210 21:17:56.300406       1 clusterapi_controller.go:454] Machine "tkg-wld-as-md-0-58588b7b5f-nfzqs" has no providerID
 I1210 21:17:56.300445       1 clusterapi_controller.go:478] Status.NodeRef of machine "tkg-wld-as-md-0-58588b7b5f-nfzqs" is currently nil
 I1210 21:17:56.300452       1 clusterapi_controller.go:509] nodegroup tkg-wld-as-md-0 has nodes [vsphere://422af9c5-6bd9-121c-e616-ea6dc706cda5 vsphere://422a1fe4-dc84-3fa6-ab50-175437c2a967]

And very soon after we can see the next attempt to analyze for a scale down opportunity does not find one (output is truncated).

kubectl logs tkg-wld-as-cluster-autoscaler-7678c4744b-rls67

 I1210 21:18:53.910197       1 scale_down.go:421] Node tkg-wld-as-md-0-58588b7b5f-g29gc - cpu utilization 1.000000
 I1210 21:18:53.910207       1 scale_down.go:424] Node tkg-wld-as-md-0-58588b7b5f-g29gc is not suitable for removal - cpu utilization too big (1.000000)

Leave a Comment

Your email address will not be published. Required fields are marked *