How to Allow for Multiple TKG Nodes to be Upgraded in Parallel

By default, when a Tanzu Kubernetes Grid cluster is upgraded, the nodes are upgraded one a time. In my labs, I usually only have a single control plane node and at most two worker nodes since they are very resource constrained and not running any kind of demanding workloads so this isn’t a concern I would normally have. However, if you are running a large cluster with dozens of nodes, waiting for each node to be upgraded (create new node, join to cluster, remove old node) one at a time would be incredibly time consuming.

TKG uses ClusterAPI to mange the instantiation and lifecycle of Kubernetes clusters. One important part of ClusterAPI is the idea of a MachineDeployment, which is used when creating a workload cluster. This will define how operations like cluster upgrades and scaling behave. Within a MachineDeployment is a spec.strategy.rollingUpdate.maxSurge parameter that defines how many nodes can be upgraded in parallel. In TKG, this value is set to 1 bye default but can be changed…either in a plan before you create a cluster or in an already-created cluster.

Option 1: Modify a plan

The various plans that TKG uses when creating clusters are the yaml files under the ~/.tkg/providers/infrastructure-vsphere/v0.6.5 folder (Note: v0.6.5 is specific to TKG 1.1.2, you may see something different if you’re on an older version). For this example, I’ll be changing the dev plan, which corresponds to the cluster-template-dev.yaml file. If you open this file in a text editor,  you’ll see a section whose kind is MachineDeployment and it should look like the following:

---
apiVersion: cluster.x-k8s.io/v1alpha3
kind: MachineDeployment
metadata:
  labels:
    cluster.x-k8s.io/cluster-name: '${ CLUSTER_NAME }'
  name: '${ CLUSTER_NAME }-md-0'
  namespace: '${ NAMESPACE }'
spec:
  clusterName: '${ CLUSTER_NAME }'
  replicas: ${ WORKER_MACHINE_COUNT }
  selector:
    matchLabels:
      cluster.x-k8s.io/cluster-name: '${ CLUSTER_NAME }'
  template:
    metadata:
      labels:
        cluster.x-k8s.io/cluster-name: '${ CLUSTER_NAME }'
        node-pool: "${CLUSTER_NAME}-worker-pool"
    spec:
      bootstrap:
        configRef:
          apiVersion: bootstrap.cluster.x-k8s.io/v1alpha3
          kind: KubeadmConfigTemplate
          name: '${ CLUSTER_NAME }-md-0'
      clusterName: '${ CLUSTER_NAME }'
      infrastructureRef:
        apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
        kind: VSphereMachineTemplate
        name: '${ CLUSTER_NAME }-worker'
      version: '${ KUBERNETES_VERSION }'
---

The maxSurge value is not present here but it defaults to 1. To make the necessary change, we’ll add the following strategy stanza just after the replicas: ${ WORKER_MACHINE_COUNT } line:

  strategy:
    rollingUpdate:
      maxSurge: 2
      maxUnavailable: 0
    type: RollingUpdate

This will allow for two nodes at a time to be upgraded.

After the cluster is created with the dev plan, you can do a describe on the MachineDeployment to see that the maxSurge value is set to 2.

kubectl describe machinedeployments.cluster.x-k8s.io vsphere-test-md-0 |grep "Max Surge"
      Max Surge:        2

To test this, I deployed my TKG 1.1.2 cluster at version 1.17.16 via the following command (you need to have the 1.17.16 OVA imported already):

tkg create cluster vsphere-test -p dev -c 1 -w 3 --kubernetes-version v1.17.6+vmware.1

When the cluster was created, I tested the new maxSurge value via an upgrade to version v1.18.3+vmware.1 (this was as simple as tkg upgrade cluster vsphere-test since I already had the 1.18.3 OVA imported). While the upgrade was happening, it was easy enough to see two templates at a time getting cloned to new Kubernetes nodes:

Option 2: Modify an existing cluster

If you have an older cluster already created and want to upgrade and take advantage of the ability to upgrade multiple nodes in parallel, you can edit the already-deployed MachineDeployment to allow for this.

Note: The MachineDeployments exist in the management cluster so make sure your Kubernetes context it set correctly before starting.

We can do a kubectl edit against the existing MachineDeployement as you would against other Kubernetes objects.

kubectl edit machinedeployments vsphere-test-md-0

You can just search for the word maxSurge and you will find the maxSurge: 2 line. Just change 2 to whatever you want and then :wq to exit. The following is an abbreviated sample of mine after changing it:

spec:
  clusterName: vsphere-test
  minReadySeconds: 0
  progressDeadlineSeconds: 600
  replicas: 3
  revisionHistoryLimit: 1
  selector:
    matchLabels:
      cluster.x-k8s.io/cluster-name: vsphere-test
  strategy:
    rollingUpdate:
      maxSurge: 2
      maxUnavailable: 0
    type: RollingUpdate

A describe against the same MachineDeployment will now show the updated value.

kubectl describe machinedeployments.cluster.x-k8s.io vsphere-test-md-0 |grep "Max Surge"
      Max Surge: 2

And any upgrades against this cluster will now run with two nodes being upgraded at a time.

You can read more about TKG upgrades at Upgrading Tanzu Kubernetes Grid.

Leave a Comment

Your email address will not be published. Required fields are marked *