Using multiple availability zones for a workload cluster in TKG 1.4 on vSphere

Before TKG 1.4, when you provisioned workload clusters on vSphere, the nodes mostly ended up randomly spread out across the available ESXi hosts. This wouldn’t provide the best experience when you’re planning for DR scenarios and want to ensure some redundancy. There is now the ability to spread nodes across multiple clusters within a single datacenter or across multiple hosts groups within a single cluster. Since my nested lab environment is incredibly small, I’m going though the option of placing nodes in multiple host groups. The configuration for this is not trivial as you’ll see but I will try to lay it out thoroughly and explain what’s being done along the way.

Note: This is an experimental feature in TKG 1.4 so you should not configure a production workload cluster in this fasion.

VMware has the process for both methods documented at Spread Nodes Across Multiple Compute Clusters in a Datacenter and Spread Nodes Across Multiple Hosts in a Single Compute Cluster.

Create VM/Host Groups and Affinity Rules

The very first thing to do is to create vm groups (for the nodes), host groups (where the nodes will run) and affinity rules (to keep the nodes in a group on the hosts in a group). This is easily done in the vSphere UI from the cluster’s Configure > VM/Host Groups page.

Click the Add button to create a new group.

Give the group a name (HG1 in this example) and specify the group type (Host Group in this example).

Click the Add button to add ESXi hosts to the host group. I’m picking the first two hosts in my cluster for this group.

Click the OK button and you should see a summary of the new host group.

Click the OK button again. You can now see the host group and it’s members on the VM/Host Groups page.

Create any additional host groups that you will need (I only created one additional host group with the other two ESXi hosts in my cluster). The number of host groups should match the number of availability zones you plan to use.

Click the Add button again to start the process of creating your first VM group.

Give the VM group a name (VMG1 in this example) and set the Type to VM Group.

This is the point where I hit a small snag. I didn’t have any VMs that I wanted in this VM group as they had not been created yet. I tried clicking OK but was prevented from proceeding by the following error: A group must have members. Add a group member. I quickly created a dummy vm (dummy1) and then clicked the Add button here to add it to the group.

Click the OK button and you should see a summary of the new VM group.

Click the OK button again. You can now see the VM group and it’s member on the VM/Host Groups page.

Repeat this process for however many VM groups you need (I only created one additional VM group with the same dummy1 VM in it). The number of VM groups should match the number of host groups.

I did find out later that you can use govc to create empty VM groups. The syntax for the command would look similar to the following:

govc cluster.group.create -cluster=RegionA01-MGMT -name=VMG1 -vm

This could also be used to create any host groups by replacing -vm with -host and specifying the hosts to add, similar to the following:

govc cluster.group.create -cluster=RegionA01-MGMT -name=VMG1 -host esx-01a.corp.tanzu esx-02a.corp.tanzu

I also tried this via PowerCLI but found that the same restriction of needing to specify a VM name during creation was there also. You could use PowerCLI to create the host groups since there are hosts to be added…the command would look similar to the following:

New-DrsClusterGroup -name "VG1" -cluster "RegionA01-MGMT" -vmhost "esx-01a.corp.tanzu", "esx-02a.corp.tanzu"

Moving on, affinity rules (VM/Host rules) were needed to pair up the VM groups with their appropriate host groups. You can start this process on the cluster’s Configure > VM/Host Rules page.

Click the Add button to create a new rule.

Give the rule a name (AZ1 in this example) and set the Type to Virtual Machines to Hosts. Set the VM group to the first VM group you created (VMG1 in this example), and the rule to Must run on hosts in group. Set the Host group to the first host group you created (HG1 in this example).

Click the OK button and you should now see the rule on the VM/Host Rules page.

Repeat this process for any additional rules you need to create (I only needed one more).

If your rules and groups are all okay, you can move on to creating and assigning tags.

Note: If you go the route of creating everything via the UI with a dummy VM in your VM groups, you can delete this VM (or just remove it from the VM groups) once your cluster is up and running.

Create and assign tags to your cluster and hosts.

vSphere tags are going to be used to identify the cluster and hosts where the Kubernetes nodes are placed. You can see that there are no tags assigned to my cluster currently (in the Tags pane on the cluster’s Summary page).

Click the Assign button in the Tags pane.

I’ve only got the one tag present (k8s-storage) that I’m using to identify my NFS datastore for inclusion in a particular storage policy. You can see this same tag from the Tags & Custom Attributes page:

Click the New link to create a new tag.

The first tag to create will go on the selected cluster and correlate to the concept of a region, so give it an appropriate name (lab in this example). You will likely need to create a new tag category as well. You can see in this screenshot that I already had one named k8s but this was also used for storage purposes.

Click the Create New Category link.

Give the category an appropriate name (k8s-region in this example). You can leave all other items as-is or leave only the appropriate Associable Object Type selected (Cluster in this example).

Click the Create button.

You should end up back on the Create Tag page where the newly created tag category is populated in the Category field.

Click the Create button here and you should be sent back to the Assign Tag page.

Select the newly created tag and click the Assign button.

You’ll now see the tag (lab in this example) in the Tags pane for the cluster.

This same process now needs to be repeated to create the availability zone tags which will be placed on the ESXi hosts.

Select the first ESXi host in the cluster that will be part of an availability zone (esx-01a.corp.tanzu in this example). You can see from the Tags pane on the Summary tab for this host that no tags are currently present.

Click on the Assign link to start the process of creating a new tag.

You can see the new lab tag that was just created as well as the original k8s-storage tag.

Click the Add Tag link.

This tag will be placed on the selected host and correlate to the concept of a zone, so give it an appropriate name (AZ1 in this example). As with the region tag, you will likely need to create a new tag category for your zone tags.

Click the Create New Category link.

Give the category an appropriate name (k8s-zone in this example). You can leave all other items as-is or leave only the appropriate Associable Object Type selected (Host in this example).

You should end up back on the Create Tag page where the newly created tag category is populated in the Category field.

Click the Create button here and you should be sent back to the Assign Tag page.

Select the newly created tag and click the Assign button.

You’ll now see the tag (lab in this example) in the Tags pane for the host.

You’ll need to repeat this process in a limited fashion…add the same tag to other hosts in the same availability zone but create new tags for hosts in other availability zones. You do not need to create any more zone-based tag categories though…they will all fall under the first one you created (k8s-zone in this example).

You can validate what you’ve configured from the Tags and Custom Attributes page:

Clicking on any of the tags will let you drill down and see what objects have that tag on them:

As you might have suspected, you can complete these tasks from the command line as well…but there is a small caveat. Using the PowerCLI cmdlet New-TagCategory or the govc tags.category.create command will to create the tag categories but you have to manually specify the objects to which they apply. When you create tag categories in the UI, they apply to all objects by default. This is arguably overkill but does make it much harder to make a mistake with your tag category definition. When assigning a tag to a resource using govc, you also need to know the full path to that resource (/RegionA01/host/RegionA01-MGMT/esx-01a.corp.tanzu for one my hosts as an example).

Create a region-based tag category, a tag under that category, and assign the tag to a cluster with govc:

govc tags.category.create -t ClusterComputeResource k8s-region
govc tags.create -c k8s-region lab
govc tags.attach lab /RegionA01/host/RegionA01-MGMT

Create a zone-based tag category, a tag under that category, and assign the tag to a host with govc:

govc tags.category.create -t HostSystem k8s-zone
govc tags.create -c k8s-zone AZ1
govc tags.attach AZ1 /RegionA01/host/RegionA01-MGMT/esx-01a.corp.tanzu

Create a region-based tag category, a tag under that category, and assign the tag to a cluster with PowerCLI:

New-TagCategory -Name "k8s-region" -EntityType "ClusterComputeResource"

Name                                     Cardinality Description
----                                     ----------- -----------
k8s-region                               Single

New-Tag -Category "k8s-region" -Name "lab2"

Name                           Category                       Description
----                           --------                       -----------
lab                            k8s-region

Get-Cluster RegionA01-MGMT | New-TagAssignment -Tag lab

Tag                                      Entity
---                                      ------
k8s-region/lab                           RegionA01-MGMT

Create a zone-based tag category, a tag under that category, and assign the tag to a host with PowerCLI:

New-TagCategory -Name k8s-zone -EntityType "HostSystem"

Name                                     Cardinality Description
----                                     ----------- -----------
k8s-zone                                 Single

New-Tag -Category "k8s-zone" -Name "AZ1"

Name                           Category                       Description
----                           --------                       -----------
AZ1                            k8s-zone

Get-VMHost esx-01a.corp.tanzu | New-TagAssignment -Tag AZ1

Tag                                      Entity
---                                      ------
k8s-zone/AZ1                             esx-01a.corp.tanzu

Define your failure domains and deployment domains

The next step is to define your VSphereFailureDomain and VSphereDeploymentZone objects. These are custom resource definitions (CRDs) that exist as part of Cluster API on vSphere (CAPV). You can read a bit more about these objects at CAPV ControlPlane Failure Domain.

The following is a sample specification that defines two VSphereFailureDomain objects and two VSphereDeploymentZone objects. These will directly correspond to the two availability zones (AZ1 and AZ2) I defined via the tags and VM/Host groups earlier.

apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
kind: VSphereFailureDomain
metadata:
 name: az1
spec:
 region:
   name: lab
   type: ComputeCluster
   tagCategory: k8s-region
 zone:
   name: AZ1
   type: HostGroup
   tagCategory: k8s-zone
 topology:
   datacenter: RegionA01
   computeCluster: RegionA01-MGMT
   hosts:
     vmGroupName: VMG1
     hostGroupName: HG1
   datastore: map-vol
   networks:
   - K8s-Workload
---
apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
kind: VSphereFailureDomain
metadata:
 name: az2
spec:
 region:
   name: lab
   type: ComputeCluster
   tagCategory: k8s-region
 zone:
   name: AZ2
   type: HostGroup
   tagCategory: k8s-zone
 topology:
   datacenter: RegionA01
   computeCluster: RegionA01-MGMT
   hosts:
     vmGroupName: VMG2
     hostGroupName: HG2
   datastore: map-vol
   networks:
   - K8s-Workload
---
apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
kind: VSphereDeploymentZone
metadata:
 name: az1
spec:
 server: vcsa-01a.corp.tanzu
 failureDomain: az1
 placementConstraint:
   resourcePool: 
   folder: 
---
apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
kind: VSphereDeploymentZone
metadata:
 name: az2
spec:
 server: vcsa-01a.corp.tanzu
 failureDomain: az2
 placementConstraint:
   resourcePool: 
   folder: 

Some very important things to point out here:

  • For VSphereFailureDomain objects, everything under spec.region, spec.zone and spec.topology must match what you have configured in vCenter. In this case:

    spec.region.name = lab, which is the name of the region-base tag assigned to my cluster
    spec.region.type = ComputeCluster, which is saying that the region corresponds to a vSphere cluster
    spec.region.tagCategory = k8s-region, which is the tag category that the “lab” region-based tag falls under

    spec.zone.name = AZ1, which is the tag name for the hosts in the first host group created, which encompasses the esx-01a and esx-02a hosts in this example (the second VSphereFailureDomain uses AZ2 for this value)
    spec.zone.type = HostGroup, which is saying that the zone corresponds to a vSphere host group
    spce.zone.tagCategory = k8s-zone, which is the tag category that the “AZ1” and “AZ2” zone-based tags fall under

    spec.topology.datacenter = RegionA01, which is the vSphere datacenter object
    spec.topology.computCluster = RegionA01-MGMT, which is the vSphere cluster object
    spec.topology.hosts.vmGroupName = VMG1, which is the first VM group created and aligns with the HG1 host group (the second VSphereFailureDomain uses VMG2 for this value)
    spec.topology.hosts.hostGroupName = HG1, which is the first host group created and aligns with VMG1 VM group (the second VSphereFailureDomain uses HG2 for this value)
    spec.topology.datastore = map-vol, this just needs to map to a datastore name where VMs on the specified hosts can reside
    spec.topology.networks = K8s-Workload, this just needs to map to a network which the VMs on the specified hosts can use

    Both the datastore and networks values should also align with the values you use when creating your workload clusters

  • For VSphereDeploymentZone objects, the spec.failuredomain value must match one of the metadata.name values of the VSphereFailureDomain definitions…i.e. the first VSphereDeploymentZone has a spec.failuredomain value of az1, which corresponds to the metadata.name value for the first VSphereFailureDomain. The same holds true for the second VSphereDeploymentZone, az2.
  • The spec.server value in the VSphereDeploymentZone objects must exactly match the vCenter Server address (IP or FQDN) as it was entered for the VCENTER SERVER value on the IaaS Provider page of the installer UI or the VSPHERE_SERVER parameter if you used a configuration file. If it these do not match, the control plane nodes in your workload cluster will not be placed into availability zones.

When you’re happy with your VSphereFailureDomain and VSphereDeploymentZone specification file, you can apply it to create the objects:

kubectl apply -f vsphere-zones.yaml

vspherefailuredomain.infrastructure.cluster.x-k8s.io/az1 created
vspherefailuredomain.infrastructure.cluster.x-k8s.io/az2 created
vspheredeploymentzone.infrastructure.cluster.x-k8s.io/az1 created
vspheredeploymentzone.infrastructure.cluster.x-k8s.io/az2 created

You can inspect the objects to ensure that they are configured as desired:

kubectl describe vspherefailuredomain az1

Name:         az1
Namespace:
Labels:       <none>
Annotations:  <none>
API Version:  infrastructure.cluster.x-k8s.io/v1alpha3
Kind:         VSphereFailureDomain
Metadata:
  Creation Timestamp:  2021-10-11T20:15:37Z
...
  Resource Version:  55786
  UID:               5eeffe54-81fa-4a67-bd44-f796ba891826
Spec:
  Region:
    Auto Configure:  false
    Name:            lab
    Tag Category:    k8s-region
    Type:            ComputeCluster
  Topology:
    Compute Cluster:  RegionA01-MGMT
    Datacenter:       RegionA01
    Datastore:        map-vol
    Hosts:
      Host Group Name:  HG1
      Vm Group Name:    VMG1
    Networks:
      K8s-Workload
  Zone:
    Auto Configure:  false
    Name:            AZ1
    Tag Category:    k8s-zone
    Type:            HostGroup
Events:              <none>
kubectl describe vspheredeploymentzone az1

Name:         az1
Namespace:
Labels:       <none>
Annotations:  <none>
API Version:  infrastructure.cluster.x-k8s.io/v1alpha3
Kind:         VSphereDeploymentZone
Metadata:
  Creation Timestamp:  2021-10-11T20:10:22Z
...
  Resource Version:  53932
  UID:               a072f2f2-057a-4ae2-b374-b9321517cf2d
Spec:
  Control Plane:   true
  Failure Domain:  AZ1
  Placement Constraint:
    Folder:         
    Resource Pool:  
  Server:           vcsa-01a.corp.tanzu
Events:             <none>

Add an overlay to modify the MachineDeployment, VSphereMachineTemplate and KubeadmConfigTemplate objects.

By default, when deploying a TKG 1.4 workload cluster, you’ll have a MachineDeployment, which will define the configuration for the Machines (since this is vSphere, these are VMs), a VSphereMachineTemplate, which defines the VSphereMachine objects, and is referenced in the spect.template.spec.infrastructureRef.kind section of the MachineDeployment object, and a KubeadmConfigTemplate, which defines how kubeadm should build out the worker nodes and join them to the cluster…and many other objects of course but these are the ones we’re worried about right now. These will all need to be modified such that they will work with our newly defined VSphereFailureDomain and VSphereDeploymentZone objects.

You will need to add content similar to the following to the ~/.config/tanzu/tkg/providers/infrastructure-vsphere/ytt/vsphere-overlay.yaml file.

#! Please add any overlays specific to vSphere provider under this file.
#@ load("@ytt:overlay", "overlay")
#@ load("@ytt:data", "data")

#@overlay/match by=overlay.subset({"kind":"MachineDeployment", "metadata":{"name": "{}-md-0".format(data.values.CLUSTER_NAME)}})
---
spec:
 template:
   spec:
     #@overlay/match missing_ok=True
     failureDomain: az1
     infrastructureRef:
       name: #@ "{}-worker-0".format(data.values.CLUSTER_NAME)  

---

#@overlay/match by=overlay.subset({"kind":"VSphereMachineTemplate", "metadata":{"name": "{}-worker".format(data.values.CLUSTER_NAME)}})
---
metadata:
 name: #@ "{}-worker-0".format(data.values.CLUSTER_NAME)
spec:
 template:
   spec:
     #@overlay/match missing_ok=True
     failureDomain: az1

---
apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
kind: VSphereMachineTemplate
metadata:
 name: #@ "{}-worker-1".format(data.values.CLUSTER_NAME)
spec:
 template:
   spec:
     cloneMode:  #@ data.values.VSPHERE_CLONE_MODE
     datacenter: #@ data.values.VSPHERE_DATACENTER
     datastore: #@ data.values.VSPHERE_DATASTORE
     storagePolicyName: #@ data.values.VSPHERE_STORAGE_POLICY_ID
     diskGiB: #@ data.values.VSPHERE_WORKER_DISK_GIB
     folder: #@ data.values.VSPHERE_FOLDER
     memoryMiB: #@ data.values.VSPHERE_WORKER_MEM_MIB
     network:
       devices:
       #@ if data.values.TKG_IP_FAMILY == "ipv6":
       #@overlay/match by=overlay.index(0)
       #@overlay/replace
       - dhcp6: true
         networkName: #@ data.values.VSPHERE_NETWORK
       #@ else:
       #@overlay/match by=overlay.index(0)
       #@overlay/replace
       - dhcp4: true
         networkName: #@ data.values.VSPHERE_NETWORK
       #@ end
     numCPUs: #@ data.values.VSPHERE_WORKER_NUM_CPUS
     resourcePool: #@ data.values.VSPHERE_RESOURCE_POOL
     server: #@ data.values.VSPHERE_SERVER
     template: #@ data.values.VSPHERE_TEMPLATE
     failureDomain: az2

---
apiVersion: bootstrap.cluster.x-k8s.io/v1alpha3
kind: KubeadmConfigTemplate
metadata:
 name: #@ "{}-md-1".format(data.values.CLUSTER_NAME)
 namespace: '${ NAMESPACE }'
spec:
 template:
   spec:
     useExperimentalRetryJoin: true
     joinConfiguration:
       nodeRegistration:
         criSocket: /var/run/containerd/containerd.sock
         kubeletExtraArgs:
           cloud-provider: external
           tls-cipher-suites: TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
         name: '{{ ds.meta_data.hostname }}'
     preKubeadmCommands:
     - hostname "{{ ds.meta_data.hostname }}"
     - echo "::1         ipv6-localhost ipv6-loopback" >/etc/hosts
     - echo "127.0.0.1   localhost" >>/etc/hosts
     - echo "127.0.0.1   {{ ds.meta_data.hostname }}" >>/etc/hosts
     - echo "{{ ds.meta_data.hostname }}" >/etc/hostname
     files: []
     users:
     - name: capv
       sshAuthorizedKeys:
       - '${ VSPHERE_SSH_AUTHORIZED_KEY }'
       sudo: ALL=(ALL) NOPASSWD:ALL

---
apiVersion: cluster.x-k8s.io/v1alpha3
kind: MachineDeployment
metadata:
 labels:
   cluster.x-k8s.io/cluster-name: #@ data.values.CLUSTER_NAME
 name: #@ "{}-md-1".format(data.values.CLUSTER_NAME)
spec:
 clusterName: #@ data.values.CLUSTER_NAME
 replicas: #@ data.values.WORKER_MACHINE_COUNT
 selector:
   matchLabels:
     cluster.x-k8s.io/cluster-name: #@ data.values.CLUSTER_NAME
 template:
   metadata:
     labels:
       cluster.x-k8s.io/cluster-name: #@ data.values.CLUSTER_NAME
       node-pool: #@ "{}-worker-pool".format(data.values.CLUSTER_NAME)
   spec:
     bootstrap:
       configRef:
         apiVersion: bootstrap.cluster.x-k8s.io/v1alpha3
         kind: KubeadmConfigTemplate
         name: #@ "{}-md-1".format(data.values.CLUSTER_NAME)
     clusterName: #@ data.values.CLUSTER_NAME
     infrastructureRef:
       apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
       kind: VSphereMachineTemplate
       name: #@ "{}-worker-1".format(data.values.CLUSTER_NAME)
     version: #@ data.values.KUBERNETES_VERSION
     failureDomain: az2

#@overlay/match by=overlay.subset({"kind":"KubeadmConfigTemplate"}), expects="1+"
---
spec:
 template:
   spec:
     users:
     #@overlay/match by=overlay.index(0)
     #@overlay/replace
     - name: capv
       sshAuthorizedKeys:
       - #@ data.values.VSPHERE_SSH_AUTHORIZED_KEY
       sudo: ALL=(ALL) NOPASSWD:ALL

To explain what’s going on here…

#@overlay/match by=overlay.subset({"kind":"MachineDeployment", "metadata":{"name": "{}-md-0".format(data.values.CLUSTER_NAME)}})
---
spec:
 template:
   spec:
     #@overlay/match missing_ok=True
     failureDomain: az1
     infrastructureRef:
       name: #@ "{}-worker-0".format(data.values.CLUSTER_NAME)

This stanza is taking the default MachineDeployment, which is named <cluster-name>-md-0, and adding the spec.template.spec.failureDomain: az1 value and also renaming it to <cluster-name>-worker-0. The renaming is to align with the naming of the vSphereMachineTemplate.

#@overlay/match by=overlay.subset({"kind":"VSphereMachineTemplate", "metadata":{"name": "{}-worker".format(data.values.CLUSTER_NAME)}})
---
metadata:
 name: #@ "{}-worker-0".format(data.values.CLUSTER_NAME)
spec:
 template:
   spec:
     #@overlay/match missing_ok=True
     failureDomain: az1

This stanza is taking the default VSphereMachineTemplate, which is named <cluster-name>-worker, and renaming it to <cluster-name>-worker-0 (since we are going to have more than one now) and adding the spec.template.spec.failuredomain: az1 value.

The next three stanzas are creating additional VSphereMachineTemplate, KubeadmConfigTemplate and MachineDeployment objects named <cluster-name>-worker-1, <cluster-name>-md-1 and <cluster-name>-md-1 respectively. The reason for this is that we need to assign a set of nodes to the second availability zone, az2, so we need separate configuration definition sections for each. If you were to make use of a third availability zone, you would need to create another set of these three stanzas, changing the naming end in -2 and setting the availability zone to az3.

The very last stanza is configuring sudo and the ssh authorized key for the capv user on the worker nodes and does not need to be modified or duplicated.

Create a workload cluster

You should be at a point where you can create a specification for a workload cluster and apply it. The following is what I have used and you can see that there isn’t a lot there.

DEPLOY_TKG_ON_VSPHERE7: true
CLUSTER_CIDR: 100.96.0.0/11
SERVICE_CIDR: 100.64.0.0/13
CLUSTER_NAME: tkg-wld
CLUSTER_PLAN: prod
IDENTITY_MANAGEMENT_TYPE: ldap
INFRASTRUCTURE_PROVIDER: vsphere
NAMESPACE: default
CNI: antrea
ENABLE_MHC: "false"
MHC_UNKNOWN_STATUS_TIMEOUT: 5m
MHC_FALSE_STATUS_TIMEOUT: 5m
OS_NAME: ubuntu
VSPHERE_CONTROL_PLANE_DISK_GIB: "20"
VSPHERE_CONTROL_PLANE_ENDPOINT: 192.168.220.129
VSPHERE_CONTROL_PLANE_MEM_MIB: "4096"
VSPHERE_CONTROL_PLANE_NUM_CPUS: "2"
VSPHERE_DATACENTER: /RegionA01
VSPHERE_DATASTORE: /RegionA01/datastore/map-vol
VSPHERE_FOLDER: /RegionA01/vm
VSPHERE_NETWORK: K8s-Workload
VSPHERE_PASSWORD: <encoded:Vk13YXJlMSE=>
VSPHERE_RESOURCE_POOL: /RegionA01/host/RegionA01-MGMT/Resources
VSPHERE_STORAGE_POLICY_ID: k8s-policy
VSPHERE_SERVER: vcsa-01a.corp.tanzu
VSPHERE_SSH_AUTHORIZED_KEY: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQC5KYNeWQgVHrDHaEhBCLF1vIR0OAtUIJwjKYkY4E/5HhEu8fPFvBOIHPFTPrtkX4vzSiMFKE5WheKGQIpW3HHlRbmRPc9oe6nNKlsUfFAaJ7OKF146Gjpb7lWs/C34mjdtxSb1D/YcHSyqK5mxhyHAXPeK8lrxG5MLOJ3X2A3iUvXcBo1NdhRdLRWQmyjs16fnPx6840x9n5NqeiukFYIVhDMFErq42AkeewsWcbZQuwViSLk2cIc09eykAjaXMojCmSbjrj0kC3sbYX+HD2OWbKohTqqO6/UABtjYgTjIS4PqsXWk63dFdcxF6ukuO6ZHaiY7h3xX2rTg9pv1oT8WBR44TYgvyRp0Bhe0u2/n/PUTRfp22cOWTA2wG955g7jOd7RVGhtMHi9gFXeUS2KodO6C4XEXC7Y2qp9p9ARlNvu11QoaDyH3l0h57Me9we+3XQNuteV69TYrJnlgWecMa/x+rcaEkgr7LD61dY9sTuufttLBP2ro4EIWoBY6F1Ozvcp8lcgi/55uUGxwiKDA6gQ+UA/xtrKk60s6MvYMzOxJiUQbWYr3MJ3NSz6PJVXMvlsAac6U+vX4U9eJP6/C1YDyBaiT96cb/B9TkvpLrhPwqMZdYVomVHsdY7YriJB93MRinKaDJor1aIE/HMsMpbgFCNA7mma9x5HS/57Imw==
    admin@corp.local
VSPHERE_TLS_THUMBPRINT: 01:8D:8B:7F:13:3A:B9:C6:90:D2:5F:17:AD:EB:AC:78:26:3C:45:FB
VSPHERE_USERNAME: administrator@vsphere.local
VSPHERE_WORKER_DISK_GIB: "20"
VSPHERE_WORKER_MEM_MIB: "8192"
VSPHERE_WORKER_NUM_CPUS: "8"
VSPHERE_REGION: k8s-region
VSPHERE_ZONE: k8s-zone
ENABLE_AUDIT_LOGGING: false
ENABLE_DEFAULT_STORAGE_CLASS: true
WORKER_MACHINE_COUNT: 2
AVI_CONTROL_PLANE_HA_PROVIDER: "true"

The VSPHERE_REGION must be set to the region-based tag category (k8s-region in this example) and the VSPHERE_ZONE must be set to the zone-based tag category (k8s-zone in this example).

It’s very important to note that the WORKER_MACHINE_COUNT value is per availability zone. By specifying a count of 2 for this setting, I will end up with four total worker nodes since I have two availability zones.

One other thing to be aware of but not directly related to topic of availability zones is that I have set AVI_CONTROL_PLANE_HA_PROVIDER to true and have specified a VSPHERE_CONTROL_PLANE_ENDPOINT value (192.168.220.129). This means that an IP has to be reserved in NSX Advanced Load Balancer (NSX ALB). You can read a little more about this topic in my earlier post, Migrating a TKG cluster control-plane endpoint from kube-vip to NSX-ALB.

From the Infrastructure, Networks page in the NSX ALB UI, you’ll need to click the Edit icon for the network where you want your control plane endpoint IP address to live (K8s-Frontend in my case).

Click the Edit button next to the appropriate network (192.168.220.0/23 in this case).

Click the Add Static IP Address Pool button and enter the desired control plane endpoint IP address as a range (192.168.220.129-1921.68.220.129 in this example).

Click the Save button.

Click the Save button again. You should see a summary of the networks and can see that there is now an additional Static IP Pool configured on the desired network (K8s-Frontend has three now).

Before you can issue the command to create the workload cluster you’ll need to know what available Kubernetes versions you have:

kubectl get tkr

NAME                              VERSION                         COMPATIBLE   CREATED
v1.17.16---vmware.2-tkg.1         v1.17.16+vmware.2-tkg.1         False        19h
v1.17.16---vmware.2-tkg.2         v1.17.16+vmware.2-tkg.2         False        19h
v1.17.16---vmware.3-tkg.1         v1.17.16+vmware.3-tkg.1         False        19h
v1.18.16---vmware.1-tkg.1         v1.18.16+vmware.1-tkg.1         False        19h
v1.18.16---vmware.1-tkg.2         v1.18.16+vmware.1-tkg.2         False        19h
v1.18.16---vmware.3-tkg.1         v1.18.16+vmware.3-tkg.1         False        19h
v1.18.17---vmware.1-tkg.1         v1.18.17+vmware.1-tkg.1         False        19h
v1.18.17---vmware.2-tkg.1         v1.18.17+vmware.2-tkg.1         False        19h
v1.19.12---vmware.1-tkg.1         v1.19.12+vmware.1-tkg.1         True         19h
v1.19.8---vmware.1-tkg.1          v1.19.8+vmware.1-tkg.1          False        19h
v1.19.8---vmware.1-tkg.2          v1.19.8+vmware.1-tkg.2          False        19h
v1.19.8---vmware.3-tkg.1          v1.19.8+vmware.3-tkg.1          False        19h
v1.19.9---vmware.1-tkg.1          v1.19.9+vmware.1-tkg.1          False        19h
v1.19.9---vmware.2-tkg.1          v1.19.9+vmware.2-tkg.1          False        19h
v1.20.4---vmware.1-tkg.1          v1.20.4+vmware.1-tkg.1          False        19h
v1.20.4---vmware.1-tkg.2          v1.20.4+vmware.1-tkg.2          False        19h
v1.20.4---vmware.3-tkg.1          v1.20.4+vmware.3-tkg.1          False        19h
v1.20.5---vmware.1-tkg.1          v1.20.5+vmware.1-tkg.1          False        19h
v1.20.5---vmware.2-fips.1-tkg.1   v1.20.5+vmware.2-fips.1-tkg.1   False        19h
v1.20.5---vmware.2-tkg.1          v1.20.5+vmware.2-tkg.1          False        19h
v1.20.8---vmware.1-tkg.2          v1.20.8+vmware.1-tkg.2          True         19h
v1.21.2---vmware.1-tkg.1          v1.21.2+vmware.1-tkg.1          True         19h

The node image I deployed is v1.21.2 so I’ll be using v1.21.2—vmware.1-tkg.1.

tanzu cluster create -f tkg-wld-cluster.yaml --tkr v1.21.2---vmware.1-tkg.1 -v 6

compatibility file (/home/ubuntu/.config/tanzu/tkg/compatibility/tkg-compatibility.yaml) already exists, skipping download
BOM files inside /home/ubuntu/.config/tanzu/tkg/bom already exists, skipping download
Using namespace from config:
Validating configuration...
Waiting for resource pinniped-info of type *v1.ConfigMap to be up and running
Creating workload cluster 'tkg-wld'...
patch cluster object with operation status:
        {
                "metadata": {
                        "annotations": {
                                "TKGOperationInfo" : "{\"Operation\":\"Create\",\"OperationStartTimestamp\":\"2021-10-12 13:06:28.927801225 +0000 UTC\",\"OperationTimeout\":1800}",
                                "TKGOperationLastObservedTimestamp" : "2021-10-12 13:06:28.927801225 +0000 UTC"
                        }
                }
        }
Waiting for cluster to be initialized...
zero or multiple KCP objects found for the given cluster, 0 tkg-wld default, retrying
[cluster control plane is still being initialized, cluster infrastructure is still being provisioned], retrying

Right away, you’ll see a new virtual service created in NSX ALB for the workload cluster’s control plane endpoint (192.168.220.129):

It’s not in a functional state as there is really nothing backing it yet since the control plane node is not up and running.

And you can see this as a Kubernetes service from the management cluster context:

kubectl get svc

NAME                            TYPE           CLUSTER-IP      EXTERNAL-IP       PORT(S)          AGE
default-tkg-wld-control-plane   LoadBalancer   100.64.12.145   192.168.220.129   6443:30498/TCP   3h21m
kubernetes                      ClusterIP      100.64.0.1      <none>            443/TCP          24h

The first control plane VM should be deployed and powered on fairly quickly.

Once kubeadm has had a chance to run on this node and configure the requisite Kubernetes processes, the new virtual service in NSX ALB should move to a healthy state.

Once the first control plane node is up and running you will see a lot more activity in the vSphere Client as additional control plane and worker nodes are created:

Once the second control plane node is functional you should see the virtual service in NSX ALB updated to show both control plane nodes:

Per the earlier statement around having a total of four worker nodes, you can see that in the vSphere Client now:

You might also notice that they are spread across the nodes as desired…the first two worker nodes are on host esx-01a.corp.tanzu, which is in AZ1 and the second two nodes are on esx-03a.corp.tanzu, which are in AZ2.

If you really want to keep a close eye on things you can ssh to one of the control plane nodes (as the capv user with the ssh key specified in the cluster configuration):

kubectl --kubeconfig=/etc/kubernetes/admin.conf get nodes

NAME                            STATUS   ROLES                  AGE   VERSION
tkg-wld-control-plane-cxh9l     Ready    control-plane,master   29m   v1.21.2+vmware.1
tkg-wld-control-plane-rb2nj     Ready    control-plane,master   13m   v1.21.2+vmware.1
tkg-wld-md-0-847b7d8cf9-26qff   Ready    <none>                 16m   v1.21.2+vmware.1
tkg-wld-md-0-847b7d8cf9-pksnq   Ready    <none>                 14m   v1.21.2+vmware.1
tkg-wld-md-1-85cc664768-llqkt   Ready    <none>                 13m   v1.21.2+vmware.1
tkg-wld-md-1-85cc664768-zpzjr   Ready    <none>                 22m   v1.21.2+vmware.1

At this point, two of the control plane nodes were functional, as were all four worker nodes.

The third control plane node was online shortly afterwards:

And the virtual service in NSX ALB finally showed all three control plane nodes:

The cluster recognized all of the control plane nodes as well:

kubectl --kubeconfig=/etc/kubernetes/admin.conf get nodes

NAME                            STATUS   ROLES                  AGE    VERSION
tkg-wld-control-plane-cxh9l     Ready    control-plane,master   43m    v1.21.2+vmware.1
tkg-wld-control-plane-lhw6j     Ready    control-plane,master   2m7s   v1.21.2+vmware.1
tkg-wld-control-plane-rb2nj     Ready    control-plane,master   27m    v1.21.2+vmware.1
tkg-wld-md-0-847b7d8cf9-26qff   Ready    <none>                 30m    v1.21.2+vmware.1
tkg-wld-md-0-847b7d8cf9-pksnq   Ready    <none>                 28m    v1.21.2+vmware.1
tkg-wld-md-1-85cc664768-llqkt   Ready    <none>                 27m    v1.21.2+vmware.1
tkg-wld-md-1-85cc664768-zpzjr   Ready    <none>                 36m    v1.21.2+vmware.1

And you can see all three of the VMs in the vSphere Client:

And again, you can see that the nodes are spread out appropriately…the first node is on esx-01a.corp.tanzu which is in AZ1 and the second two hosts are on esx03a.corp.tanzu and esx-04a.corp.tanzu which are in AZ2.

Back at the command line where the tanzu cluster create command was run, the following was the remainder of the output:

cluster control plane is still being initialized, retrying
Getting secret for cluster
Waiting for resource tkg-wld-kubeconfig of type *v1.Secret to be up and running
Waiting for cluster nodes to be available...
Waiting for resource tkg-wld of type *v1alpha3.Cluster to be up and running
Waiting for resources type *v1alpha3.MachineDeploymentList to be up and running
Waiting for resources type *v1alpha3.MachineList to be up and running
Waiting for addons installation...
Waiting for resources type *v1alpha3.ClusterResourceSetList to be up and running
Waiting for resource antrea-controller of type *v1.Deployment to be up and running
Waiting for packages to be up and running...
Waiting for package: antrea
Waiting for package: load-balancer-and-ingress-service
Waiting for package: metrics-server
Waiting for package: pinniped
Waiting for package: vsphere-cpi
Waiting for package: vsphere-csi
Waiting for resource vsphere-csi of type *v1alpha1.PackageInstall to be up and running
Waiting for resource vsphere-cpi of type *v1alpha1.PackageInstall to be up and running
Waiting for resource metrics-server of type *v1alpha1.PackageInstall to be up and running
Waiting for resource load-balancer-and-ingress-service of type *v1alpha1.PackageInstall to be up and running
Waiting for resource pinniped of type *v1alpha1.PackageInstall to be up and running
Waiting for resource antrea of type *v1alpha1.PackageInstall to be up and running
Successfully reconciled package: vsphere-csi
Successfully reconciled package: metrics-server
Successfully reconciled package: antrea
Successfully reconciled package: vsphere-cpi
Successfully reconciled package: pinniped
packageinstalls.packaging.carvel.dev "load-balancer-and-ingress-service" not found, retrying
waiting for 'load-balancer-and-ingress-service' Package to be installed, retrying
waiting for 'load-balancer-and-ingress-service' Package to be installed, retrying
waiting for 'load-balancer-and-ingress-service' Package to be installed, retrying
waiting for 'load-balancer-and-ingress-service' Package to be installed, retrying
Successfully reconciled package: load-balancer-and-ingress-service

Workload cluster 'tkg-wld' created

Inspect the cluster and validate node placement

You can use the tanzu cluster list command to see a high-level view of the installed clusters:

tanzu cluster list --include-management-cluster

  NAME      NAMESPACE   STATUS   CONTROLPLANE  WORKERS  KUBERNETES        ROLES       PLAN
  tkg-wld   default     running  3/3           4/4      v1.21.2+vmware.1  <none>      prod
  tkg-mgmt  tkg-system  running  1/1           1/1      v1.21.2+vmware.1  management  dev

And the tanzu cluster get command will give more details:

tanzu cluster get tkg-wld
  NAME     NAMESPACE  STATUS   CONTROLPLANE  WORKERS  KUBERNETES        ROLES
  tkg-wld  default    running  3/3           4/4      v1.21.2+vmware.1  <none>
â
 ¹

Details:

NAME                                                        READY  SEVERITY  REASON  SINCE  MESSAGE                           
/tkg-wld                                                    True                     6m43s                                    
ââClusterInfrastructure - VSphereCluster/tkg-wld            True                     52m                                      
ââControlPlane - KubeadmControlPlane/tkg-wld-control-plane  True                     6m44s                                    
â ââ3 Machines...                                           True                     48m    See tkg-wld-control-plane-cxh9l, tkg-wld-control-plane-lhw6j, ...
ââWorkers                                                                                                                     
  ââMachineDeployment/tkg-wld-md-0                                                                                            
  â ââ2 Machines...                                         True                     36m    See tkg-wld-md-0-847b7d8cf9-26qff, tkg-wld-md-0-847b7d8cf9-pksnq
  ââMachineDeployment/tkg-wld-md-1                                                                                            
    ââ2 Machines...                                         True                     41m    See tkg-wld-md-1-85cc664768-llqkt, tkg-wld-md-1-85cc664768-zpzjr

Inspecting the configuration of the objects that were configured to be in specific availability zones is not terribly difficult:

kubectl get machinedeployment  -o=custom-columns=NAME:.metadata.name,FAILUREDOMAIN:.spec.template.spec.failureDomain

NAME           FAILUREDOMAIN
tkg-wld-md-0   az1
tkg-wld-md-1   az2
kubectl get machine -o=custom-columns=NAME:.metadata.name,FAILUREDOMAIN:.spec.failureDomain

NAME                            FAILUREDOMAIN
tkg-wld-control-plane-cxh9l     az1
tkg-wld-control-plane-lhw6j     az2
tkg-wld-control-plane-rb2nj     az2
tkg-wld-md-0-847b7d8cf9-26qff   az1
tkg-wld-md-0-847b7d8cf9-pksnq   az1
tkg-wld-md-1-85cc664768-llqkt   az2
tkg-wld-md-1-85cc664768-zpzjr   az2

The worker machines are owned by a Machineset which is turn owned by the MachineDeployment that was configured earlier (via the overlay file). The control plane VMs were not configured but still ended up in a failureDomain. There is a controller running that checks whether the spec.controlPlane value in the deployment zone definition is set to true and updates the control plane machines to set their spec.failureDomain value appropriately. You can also check at the cluster level to validate that the different availability zones are configured for the control plane nodes:

kubectl get cluster tkg-wld -o json | jq .status.failureDomains
{
  "az1": {
    "controlPlane": true
  },
  "az2": {
    "controlPlane": true
  }
}

After tracing the ownership back to the cluster itself, you can see that both failure domains are enabled for the control plane nodes at the cluster level.

kubectl get vspheremachinetemplate  -o=custom-columns=NAME:.metadata.name,FAILUREDOMAIN:.spec.template.spec.failureDomain

NAME                    FAILUREDOMAIN
tkg-wld-control-plane   <none>
tkg-wld-worker-0        az1
tkg-wld-worker-1        az2
kubectl get vspheremachine -o=custom-columns=NAME:.metadata.name,FAILUREDOMAIN:.spec.failureDomain

NAME                          FAILUREDOMAIN
tkg-wld-control-plane-ck4rm   <none>
tkg-wld-control-plane-fthps   <none>
tkg-wld-control-plane-xv9dh   <none>
tkg-wld-worker-0-265bz        az1
tkg-wld-worker-0-rp2mw        az1
tkg-wld-worker-1-7dmbv        az2
tkg-wld-worker-1-8g66l        az2

It is expected that the VSphereMachine and VSphereMachineTemplate objects only show failure domain values for worker objects since we only configured worker objects in the overlay file. As noted earlier, the control plane machine objects are managed by the KubeadmControlPlane (and subsequently the cluster) object.

You can also see that the nodes have been spread across appropriate VM groups in the vSphere Client:

3 thoughts on “Using multiple availability zones for a workload cluster in TKG 1.4 on vSphere”

  1. Chris,
    Excellent post, I think your ytt is wrong,

    infrastructureRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
    kind: VSphereMachineTemplate
    name: #@ “{}-md-1”.format(data.values.CLUSTER_NAME)
    version: #@ data.values.KUBERNETES_VERSION

    Should in fact be

    infrastructureRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
    kind: VSphereMachineTemplate
    name: #@ “{}-worker-1”.format(data.values.CLUSTER_NAME)
    version: #@ data.values.KUBERNETES_VERSION

    I made this change and my cluster deployed, with -md-1, any nodes in the second zone do not deploy as there is no vspheremachinetemplate with name …..-md-1.

    1. Thank you so much for catching this! I had pasted in a not-quite-complete version of the overlay file before I had addressed that naming issue in the final version. I have updated the post to reflect the correct name of the second vspheremachinetemplate in the second machinedeployment spec section.

Leave a Comment

Your email address will not be published. Required fields are marked *