Upgrading TKG clusters is a fairly simple process, as seen in my previous post, How to Upgrade a Tanzu Kubernetes Grid cluster from 1.0 to 1.1. I decided to go through the process a second time, going from the 1.1.3 version to the 1.2.1 version. The process was just as simple with the exception of the migration from HAProxy and Kube-VIP. This added a few extra steps but nothing overly burdensome.
We start out with identifying the management and workload clusters present in the environment. In my current lab, I’ve got two that are running on the 1.1.3 version, running Kubernetes version 1.18.6.
tkg version
Client:
Version: v1.1.3
tkg get cluster --include-management-cluster
NAME NAMESPACE STATUS CONTROLPLANE WORKERS KUBERNETES
tkg-113-wld default running 1/1 1/1 v1.18.6+vmware.1
tkg-113-mgmt tkg-system running 1/1 1/1 v1.18.6+vmware.1
kubectl get cluster -A --show-labels
NAMESPACE NAME PHASE LABELS
default tkg-113-wld Provisioned
tkg-system tkg-113-mgmt Provisioned
The main two things to do, as noted in How to Upgrade a Tanzu Kubernetes Grid cluster from 1.0 to 1.1, are to download the updated tkg cli binary and the latest Kubernetes node OS ova file. There is no need for the HAProxy ova file since we’re moving to Kube-VIP. With the new cli in place and the new ova saved as a template in VC, we can run our first tkg command with the new binary to update the relevant TKG files
tkg get mc
It seems that the TKG settings on this system are out-of-date. Proceeding on this command will cause them to be backed up and overwritten by the latest settings.
Do you want to continue? [y/N]: y
the old providers folder /home/ubuntu/.tkg/providers is backed up to /home/ubuntu/.tkg/providers-20201211121325-ppjuhf4h
The old bom folder /home/ubuntu/.tkg/bom is backed up to /home/ubuntu/.tkg/bom-20201211121325-6741cksu
MANAGEMENT-CLUSTER-NAME CONTEXT-NAME STATUS
tkg-113-mgmt * tkg-113-mgmt-admin@tkg-113-mgmt Success
tkg version
Client:
Version: v1.2.1
Before moving ahead with the upgrade, we need to label the management cluster as a management cluster (something not done in the 1.1 version).
kubectl -n tkg-system label cluster tkg-113-mgmt cluster-role.tkg.tanzu.vmware.com/management="" --overwrite=true
cluster.cluster.x-k8s.io/tkg-113-mgmt labeled
kubectl get cluster -A --show-labels
NAMESPACE NAME PHASE LABELS
default tkg-113-wld Provisioned
tkg-system tkg-113-mgmt Provisioned cluster-role.tkg.tanzu.vmware.com/management=
With that out of the way, we’re ready to kick off the upgrade of the management cluster.
tkg upgrade management-cluster tkg-113-mgmt
Logs of the command execution can also be found at: /tmp/tkg-20201211T121846451369822.log
Upgrading management cluster 'tkg-113-mgmt' to TKG version 'v1.2.1' with Kubernetes version 'v1.19.3+vmware.1'. Are you sure? [y/N]: y
Upgrading management cluster providers…
Checking cert-manager version…
Deleting cert-manager Version="v0.11.0"
Installing cert-manager Version="v0.16.1"
Waiting for cert-manager to be available…
Performing upgrade…
Deleting Provider="cluster-api" Version="" TargetNamespace="capi-system"
Installing Provider="cluster-api" Version="v0.3.11" TargetNamespace="capi-system"
Deleting Provider="bootstrap-kubeadm" Version="" TargetNamespace="capi-kubeadm-bootstrap-system"
Installing Provider="bootstrap-kubeadm" Version="v0.3.11" TargetNamespace="capi-kubeadm-bootstrap-system"
Deleting Provider="control-plane-kubeadm" Version="" TargetNamespace="capi-kubeadm-control-plane-system"
Installing Provider="control-plane-kubeadm" Version="v0.3.11" TargetNamespace="capi-kubeadm-control-plane-system"
Deleting Provider="infrastructure-vsphere" Version="" TargetNamespace="capv-system"
Installing Provider="infrastructure-vsphere" Version="v0.7.1" TargetNamespace="capv-system"
Management cluster providers upgraded successfully…
Upgrading management cluster kubernetes version…
Verifying kubernetes version…
Retrieving configuration for upgrade cluster…
consuming Azure VM image information from BOM
Create InfrastructureTemplate for upgrade…
Upgrading control plane nodes…
Patching KubeadmControlPlane with the kubernetes version v1.19.3+vmware.1…
Waiting for kubernetes version to be updated for control plane nodes
At this point, we’ll finally see some activity in vSphere. A new control plane VM is created from the 1.19.3 Kubernetes node OS image that has been saved as a template.
And just a few minutes later we can see the original control plane node for the management cluster getting deleted.
With the control plane node done (I only had one), the upgrade process moves on to the worker node.
Upgrading worker nodes…
Patching MachineDeployment with the kubernetes version v1.19.3+vmware.1…
Waiting for kubernetes version to be updated for worker nodes…
And just as with the control plane, we’ll see new worker nodes created from the 1.19.13 image (only one again in my case) and the originals deleted.
And within a few minutes, the process is completed.
Management cluster 'tkg-113-mgmt' successfully upgraded to TKG version 'v1.2.1' with kubernetes version 'v1.19.3+vmware.1'
tkg get cluster --include-management-cluster
NAME NAMESPACE STATUS CONTROLPLANE WORKERS KUBERNETES ROLES
tkg-113-wld default running 1/1 1/1 v1.18.6+vmware.1
tkg-113-mgmt tkg-system running 1/1 1/1 v1.19.3+vmware.1 management
Upgrading the workload cluster follows the same process.
tkg upgrade cluster tkg-113-wld
Logs of the command execution can also be found at: /tmp/tkg-20201211T132203363915741.log
Upgrading workload cluster 'tkg-113-wld' to kubernetes version 'v1.19.3+vmware.1'. Are you sure? [y/N]: y
Validating configuration…
Verifying kubernetes version…
Retrieving configuration for upgrade cluster…
consuming Azure VM image information from BOM
Create InfrastructureTemplate for upgrade…
Upgrading control plane nodes…
Patching KubeadmControlPlane with the kubernetes version v1.19.3+vmware.1…
Waiting for kubernetes version to be updated for control plane nodes
Upgrading worker nodes…
Patching MachineDeployment with the kubernetes version v1.19.3+vmware.1…
Waiting for kubernetes version to be updated for worker nodes…
Cluster 'tkg-113-wld' successfully upgraded to kubernetes version 'v1.19.3+vmware.1'
We can now see that both clusters are running the newer Kubernetes version.
tkg get cluster --include-management-cluster
NAME NAMESPACE STATUS CONTROLPLANE WORKERS KUBERNETES ROLES
tkg-113-wld default running 1/1 1/1 v1.19.3+vmware.1
tkg-113-mgmt tkg-system running 1/1 1/1 v1.19.3+vmware.1 management
While the management and workload clusters are both functional, we don’t want to leave them using HAProxy for load balancing the API endpoint IP address. Kube-VIP handles that now and the original HAProxy VMs can be removed once the functionality is migrated.
The first thing to do is to validate the IP address in use for the management cluster. An object of type haproxyloadbalancer should exist and we just need to get the IP address associated with it.
kubectl -n tkg-system get haproxyloadbalancer
NAME AGE
tkg-113-mgmt-tkg-system 3h42m
kubectl -n tkg-system get haproxyloadbalancer tkg-113-mgmt-tkg-system -o template='{{.status.address}}'
192.168.100.100
With the IP address validated, we can now prepare to update the kubeadm configuration to make use of . We’ll create a small yaml file with the kube-vip info we need and the IP address that was used by HAProxy (used for the vip_address
value).
cat patch.yaml
spec:
kubeadmConfigSpec:
files:
- content: |
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
name: kube-vip
namespace: kube-system
spec:
containers:
- args:
- start
env:
- name: vip_arp
value: "true"
- name: vip_leaderelection
value: "true"
- name: vip_address
value: 192.168.100.100
- name: vip_interface
value: eth0
- name: vip_leaseduration
value: "15"
- name: vip_renewdeadline
value: "10"
- name: vip_retryperiod
value: "2"
image: registry.tkg.vmware.run/kube-vip:v0.1.8_vmware.1
imagePullPolicy: IfNotPresent
name: kube-vip
resources: {}
securityContext:
capabilities:
add:
- NET_ADMIN
- SYS_TIME
volumeMounts:
- mountPath: /etc/kubernetes/admin.conf
name: kubeconfig
hostNetwork: true
volumes:
- hostPath:
path: /etc/kubernetes/admin.conf
type: FileOrCreate
name: kubeconfig
status: {}
owner: root:root
path: /etc/kubernetes/manifests/kube-vip.yaml
We need to validate the control plane name for the management cluster so we can patch it.
kubectl -n tkg-system get kcp
NAME INITIALIZED API SERVER AVAILABLE VERSION REPLICAS READY UPDATED UNAVAILABLE
tkg-113-mgmt-control-plane true true v1.19.3+vmware.1 1 1 1
And now we’re ready to patch the tkg-113-mgmt-control-plane
kcp object.
kubectl -n tkg-system patch kcp tkg-113-mgmt-control-plane --type merge --patch "$(cat patch.yaml)"
kubeadmcontrolplane.controlplane.cluster.x-k8s.io/tkg-113-mgmt-control-plane patched
This will reconcile in the background but you’ll see a new control plane node getting created in the vSphere client and the original one getting removed. If you have multiple control plane nodes, the process will roll through them all one at a time.
I happened to check on the nodes in the cluster just as the original control plane node was about to be deleted.
kubectl get nodes
NAME STATUS ROLES AGE VERSION
tkg-113-mgmt-control-plane-98rvc Ready,SchedulingDisabled master 158m v1.19.3+vmware.1
tkg-113-mgmt-control-plane-k578g Ready master 6m23s v1.19.3+vmware.1
tkg-113-mgmt-md-0-68d475c787-8gdnt Ready 145m v1.19.3+vmware.1
Once the process is done, we can check on the most recently created control plane node and make sure that it’s functional.
kubectl get ma | grep $(kubectl get ma --sort-by=.metadata.creationTimestamp -o jsonpath="{.items[-1:].metadata.name}")
tkg-113-wld-md-0-7dfd997c99-b7mlm vsphere://422af54b-8b56-ae81-8139-0c8d5c41948b Running v1.19.3+vmware.1
And now we can remove the haproxyloadbalancer object.
kubectl -n tkg-system delete haproxyloadbalancer tkg-113-mgmt-tkg-system
haproxyloadbalancer.infrastructure.cluster.x-k8s.io "tkg-113-mgmt-tkg-system" deleted
And the HAProxy VM is removed from vSphere.
We need to edit the vspherecluster object to remove the reference to HAProxy.
kubectl -n tkg-system edit vspherecluster tkg-113-mgmt
Remove the following stanza:
loadBalancerRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
kind: HAProxyLoadBalancer
name: tkg-113-mgmt-tkg-system
vspherecluster.infrastructure.cluster.x-k8s.io/tkg-113-mgmt edited
Lastly, since the IP Address for the HAProxy VM was assigned via DHCP, we need to create a static dhcp reservation for the Kube-VIP address, with a bogus MAC address that will not possibly come into use, and modify the IP range in use to exclude this IP address. This will ensure that the DHCP server does not ever reassign the Kube-VIP address to some other machine. In my environment, I’m running a simple Linux DHCP server and only need to create a stanza similar to the following:
host mgmt-kube-vip {
hardware ethernet 00:11:22:33:44:55;
fixed-address 192.168.100.100;
}
And my IP range is modified from:
range 192.168.100.100 192.168.100.250;
To:
range 192.168.100.101 192.168.100.250;
The process for moving from HAProxy to Kube-VIP for the workload cluster is identical to what was done for the management cluster and is also done from the context of the management cluster…the largest difference is that the objects for the workload cluster live in the default namespace.
kubectl get haproxyloadbalancer
NAME AGE
tkg-113-wld-tkg-system 4h05m
kubectl get haproxyloadbalancer tkg-113-wld-default -o template='{{.status.address}}'
192.168.100.103
cat patch.yaml
spec:
kubeadmConfigSpec:
files:
- content: |
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
name: kube-vip
namespace: kube-system
spec:
containers:
- args:
- start
env:
- name: vip_arp
value: "true"
- name: vip_leaderelection
value: "true"
- name: vip_address
value: 192.168.100.103
- name: vip_interface
value: eth0
- name: vip_leaseduration
value: "15"
- name: vip_renewdeadline
value: "10"
- name: vip_retryperiod
value: "2"
image: registry.tkg.vmware.run/kube-vip:v0.1.8_vmware.1
imagePullPolicy: IfNotPresent
name: kube-vip
resources: {}
securityContext:
capabilities:
add:
- NET_ADMIN
- SYS_TIME
volumeMounts:
- mountPath: /etc/kubernetes/admin.conf
name: kubeconfig
hostNetwork: true
volumes:
- hostPath:
path: /etc/kubernetes/admin.conf
type: FileOrCreate
name: kubeconfig
status: {}
owner: root:root
path: /etc/kubernetes/manifests/kube-vip.yaml
kubectl get kcp
NAME INITIALIZED API SERVER AVAILABLE VERSION REPLICAS READY UPDATED UNAVAILABLE
tkg-113-wld-control-plane true true v1.19.3+vmware.1 1 1 1
kubectl patch kcp tkg-113-wld-control-plane --type merge --patch "$(cat patch.yaml)"
kubeadmcontrolplane.controlplane.cluster.x-k8s.io/tkg-113-wld-control-plane patched
kubectl get ma | grep $(kubectl get ma --sort-by=.metadata.creationTimestamp -o jsonpath="{.items[-1:].metadata.name}")
tkg-113-wld-control-plane-c7m6z vsphere://422aeb95-8657-3557-a77f-6bc25bb0f075 Running v1.19.3+vmware.1
kubectl delete haproxyloadbalancer tkg-113-wld-default
haproxyloadbalancer.infrastructure.cluster.x-k8s.io "tkg-113-wld-default" deleted
kubectl edit vspherecluster tkg-113-wld
Remove the following stanza:
loadBalancerRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
kind: HAProxyLoadBalancer
name: tkg-113-wld-tkg-system
vspherecluster.infrastructure.cluster.x-k8s.io/tkg-113-wld edited
And another static DHCP reservation needs to be created for the workload cluster’s Kube-VIP address.
host mgmt-kube-vip {
hardware ethernet 00:11:22:33:44:56;
fixed-address 192.168.100.103;
}
My DHCP scope looks like the following now (to exclude both the .100 and .103 addresses:
range 192.168.100.101 192.168.100.102;
range 192.168.100.104 192.168.100.250;
And both of my clusters are now upgraded to the 1.2.1 version running Kubernetes 1.19.3.
I ran into an interesting issue when I went to create a new 1.2.1 cluster.
tkg create cluster tkg-121-wld2 -p dev --vsphere-controlplane-endpoint=192.168.100.200
Logs of the command execution can also be found at: /tmp/tkg-20201212T063545219579220.log
Validating configuration…
Error: : workload cluster configuration validation failed: vSphere config validation failed: vSphere node size validation failed: the minimum requirement of VSPHERE_CONTROL_PLANE_NUM_CPUS is 2
Detailed log about the failure can be found at: /tmp/tkg-20201212T063545219579220.log
When I stood up the 1.1.3 environment, I had explicitly defined the number of CPUs to be assigned to the control plane and worker nodes at “1”. While this was acceptable for TKG 1.1, two is the minimum in TKG 1.2.1. These values can be set as command line variables or in the .tkg/config.yaml
file. Getting around this was as simple as removing the values (or updating them to “2” or higher. If you run into an error similar to this one, check your .tkg/config.yaml
file or env for values similar to the following (and remove/update them).
VSPHERE_WORKER_NUM_CPUS: "1"
VSPHERE_CONTROL_PLANE_NUM_CPUS: "1"