Upgrading from TKG 1.1 to 1.2

Upgrading TKG clusters is a fairly simple process, as seen in my previous post, How to Upgrade a Tanzu Kubernetes Grid cluster from 1.0 to 1.1. I decided to go through the process a second time, going from the 1.1.3 version to the 1.2.1 version. The process was just as simple with the exception of the migration from HAProxy and Kube-VIP. This added a few extra steps but nothing overly burdensome.

We start out with identifying the management and workload clusters present in the environment. In my current lab, I’ve got two that are running on the 1.1.3 version, running Kubernetes version 1.18.6.

tkg version

Client:
        Version: v1.1.3
tkg get cluster --include-management-cluster

  NAME          NAMESPACE   STATUS   CONTROLPLANE  WORKERS  KUBERNETES
  tkg-113-wld   default     running  1/1           1/1      v1.18.6+vmware.1
  tkg-113-mgmt  tkg-system  running  1/1           1/1      v1.18.6+vmware.1
kubectl get cluster -A --show-labels

 NAMESPACE    NAME           PHASE         LABELS
 default      tkg-113-wld    Provisioned   
 tkg-system   tkg-113-mgmt   Provisioned   

The main two things to do, as noted in How to Upgrade a Tanzu Kubernetes Grid cluster from 1.0 to 1.1, are to download the updated tkg cli binary and the latest Kubernetes node OS ova file. There is no need for the HAProxy ova file since we’re moving to Kube-VIP. With the new cli in place and the new ova saved as a template in VC, we can run our first tkg command with the new binary to update the relevant TKG files

tkg get mc

 It seems that the TKG settings on this system are out-of-date. Proceeding on this command will cause them to be backed up and overwritten by the latest settings.
 Do you want to continue? [y/N]: y
 the old providers folder /home/ubuntu/.tkg/providers is backed up to /home/ubuntu/.tkg/providers-20201211121325-ppjuhf4h
 The old bom folder /home/ubuntu/.tkg/bom is backed up to /home/ubuntu/.tkg/bom-20201211121325-6741cksu
  MANAGEMENT-CLUSTER-NAME  CONTEXT-NAME                     STATUS
  tkg-113-mgmt *           tkg-113-mgmt-admin@tkg-113-mgmt  Success
tkg version
Client:
        Version: v1.2.1

Before moving ahead with the upgrade, we need to label the management cluster as a management cluster (something not done in the 1.1 version).

kubectl -n tkg-system label cluster tkg-113-mgmt cluster-role.tkg.tanzu.vmware.com/management="" --overwrite=true

cluster.cluster.x-k8s.io/tkg-113-mgmt labeled

kubectl get cluster -A --show-labels

 NAMESPACE    NAME           PHASE         LABELS
 default      tkg-113-wld    Provisioned   
 tkg-system   tkg-113-mgmt   Provisioned   cluster-role.tkg.tanzu.vmware.com/management=

With that out of the way, we’re ready to kick off the upgrade of the management cluster.

tkg upgrade management-cluster tkg-113-mgmt

Logs of the command execution can also be found at: /tmp/tkg-20201211T121846451369822.log
Upgrading management cluster 'tkg-113-mgmt' to TKG version 'v1.2.1' with Kubernetes version 'v1.19.3+vmware.1'. Are you sure? [y/N]: y
Upgrading management cluster providers…
Checking cert-manager version…
Deleting cert-manager Version="v0.11.0"
Installing cert-manager Version="v0.16.1"
Waiting for cert-manager to be available…
Performing upgrade…
Deleting Provider="cluster-api" Version="" TargetNamespace="capi-system"
Installing Provider="cluster-api" Version="v0.3.11" TargetNamespace="capi-system"
Deleting Provider="bootstrap-kubeadm" Version="" TargetNamespace="capi-kubeadm-bootstrap-system"
Installing Provider="bootstrap-kubeadm" Version="v0.3.11" TargetNamespace="capi-kubeadm-bootstrap-system"
Deleting Provider="control-plane-kubeadm" Version="" TargetNamespace="capi-kubeadm-control-plane-system"
Installing Provider="control-plane-kubeadm" Version="v0.3.11" TargetNamespace="capi-kubeadm-control-plane-system"
Deleting Provider="infrastructure-vsphere" Version="" TargetNamespace="capv-system"
Installing Provider="infrastructure-vsphere" Version="v0.7.1" TargetNamespace="capv-system"
Management cluster providers upgraded successfully…
Upgrading management cluster kubernetes version…
Verifying kubernetes version…
Retrieving configuration for upgrade cluster…
consuming Azure VM image information from BOM
Create InfrastructureTemplate for upgrade…
Upgrading control plane nodes…
Patching KubeadmControlPlane with the kubernetes version v1.19.3+vmware.1…
Waiting for kubernetes version to be updated for control plane nodes

At this point, we’ll finally see some activity in vSphere. A new control plane VM is created from the 1.19.3 Kubernetes node OS image that has been saved as a template.

And just a few minutes later we can see the original control plane node for the management cluster getting deleted.

With the control plane node done (I only had one), the upgrade process moves on to the worker node.

Upgrading worker nodes…
Patching MachineDeployment with the kubernetes version v1.19.3+vmware.1…
Waiting for kubernetes version to be updated for worker nodes…

And just as with the control plane, we’ll see new worker nodes created from the 1.19.13 image (only one again in my case) and the originals deleted.

And within a few minutes, the process is completed.

Management cluster 'tkg-113-mgmt' successfully upgraded to TKG version 'v1.2.1' with kubernetes version 'v1.19.3+vmware.1'
tkg get cluster --include-management-cluster

 NAME          NAMESPACE   STATUS   CONTROLPLANE  WORKERS  KUBERNETES        ROLES
 tkg-113-wld   default     running  1/1           1/1      v1.18.6+vmware.1  
 tkg-113-mgmt  tkg-system  running  1/1           1/1      v1.19.3+vmware.1  management

Upgrading the workload cluster follows the same process.

tkg upgrade cluster tkg-113-wld

Logs of the command execution can also be found at: /tmp/tkg-20201211T132203363915741.log
Upgrading workload cluster 'tkg-113-wld' to kubernetes version 'v1.19.3+vmware.1'. Are you sure? [y/N]: y
Validating configuration…
Verifying kubernetes version…
Retrieving configuration for upgrade cluster…
consuming Azure VM image information from BOM
Create InfrastructureTemplate for upgrade…
Upgrading control plane nodes…
Patching KubeadmControlPlane with the kubernetes version v1.19.3+vmware.1…
Waiting for kubernetes version to be updated for control plane nodes
Upgrading worker nodes…
Patching MachineDeployment with the kubernetes version v1.19.3+vmware.1…
Waiting for kubernetes version to be updated for worker nodes…
Cluster 'tkg-113-wld' successfully upgraded to kubernetes version 'v1.19.3+vmware.1'

We can now see that both clusters are running the newer Kubernetes version.

tkg get cluster --include-management-cluster

 NAME          NAMESPACE   STATUS   CONTROLPLANE  WORKERS  KUBERNETES        ROLES
 tkg-113-wld   default     running  1/1           1/1      v1.19.3+vmware.1  
 tkg-113-mgmt  tkg-system  running  1/1           1/1      v1.19.3+vmware.1  management

While the management and workload clusters are both functional, we don’t want to leave them using HAProxy for load balancing the API endpoint IP address. Kube-VIP handles that now and the original HAProxy VMs can be removed once the functionality is migrated.

The first thing to do is to validate the IP address in use for the management cluster. An object of type haproxyloadbalancer should exist and we just need to get the IP address associated with it.

kubectl -n tkg-system get haproxyloadbalancer

NAME                      AGE
tkg-113-mgmt-tkg-system   3h42m
kubectl -n tkg-system get haproxyloadbalancer tkg-113-mgmt-tkg-system -o template='{{.status.address}}'

192.168.100.100

With the IP address validated, we can now prepare to update the kubeadm configuration to make use of . We’ll create a small yaml file with the kube-vip info we need and the IP address that was used by HAProxy (used for the vip_address value).

cat patch.yaml

spec:
  kubeadmConfigSpec:
    files:
    - content: |
        apiVersion: v1
        kind: Pod
        metadata:
          creationTimestamp: null
          name: kube-vip
          namespace: kube-system
        spec:
          containers:
          - args:
            - start
            env:
            - name: vip_arp
              value: "true"
            - name: vip_leaderelection
              value: "true"
            - name: vip_address
              value: 192.168.100.100
            - name: vip_interface
              value: eth0
            - name: vip_leaseduration
              value: "15"
            - name: vip_renewdeadline
              value: "10"
            - name: vip_retryperiod
              value: "2"
            image: registry.tkg.vmware.run/kube-vip:v0.1.8_vmware.1
            imagePullPolicy: IfNotPresent
            name: kube-vip
            resources: {}
            securityContext:
              capabilities:
                add:
                - NET_ADMIN
                - SYS_TIME
            volumeMounts:
            - mountPath: /etc/kubernetes/admin.conf
              name: kubeconfig
          hostNetwork: true
          volumes:
          - hostPath:
              path: /etc/kubernetes/admin.conf
              type: FileOrCreate
            name: kubeconfig
        status: {}
      owner: root:root
      path: /etc/kubernetes/manifests/kube-vip.yaml

We need to validate the control plane name for the management cluster so we can patch it.

kubectl -n tkg-system get kcp

NAME                         INITIALIZED   API SERVER AVAILABLE   VERSION            REPLICAS   READY   UPDATED   UNAVAILABLE
tkg-113-mgmt-control-plane   true          true                   v1.19.3+vmware.1   1          1       1

And now we’re ready to patch the tkg-113-mgmt-control-plane kcp object.

kubectl -n tkg-system patch kcp tkg-113-mgmt-control-plane --type merge --patch "$(cat patch.yaml)"

kubeadmcontrolplane.controlplane.cluster.x-k8s.io/tkg-113-mgmt-control-plane patched

This will reconcile in the background but you’ll see a new control plane node getting created in the vSphere client and the original one getting removed. If you have multiple control plane nodes, the process will roll through them all one at a time.

I happened to check on the nodes in the cluster just as the original control plane node was about to be deleted.

kubectl get nodes

NAME                                 STATUS                     ROLES    AGE     VERSION
tkg-113-mgmt-control-plane-98rvc     Ready,SchedulingDisabled   master   158m    v1.19.3+vmware.1
tkg-113-mgmt-control-plane-k578g     Ready                      master   6m23s   v1.19.3+vmware.1
tkg-113-mgmt-md-0-68d475c787-8gdnt   Ready                               145m    v1.19.3+vmware.1

Once the process is done, we can check on the most recently created control plane node and make sure that it’s functional.

kubectl get ma | grep $(kubectl get ma --sort-by=.metadata.creationTimestamp -o jsonpath="{.items[-1:].metadata.name}")

tkg-113-wld-md-0-7dfd997c99-b7mlm   vsphere://422af54b-8b56-ae81-8139-0c8d5c41948b   Running   v1.19.3+vmware.1

And now we can remove the haproxyloadbalancer object.

kubectl -n tkg-system delete haproxyloadbalancer tkg-113-mgmt-tkg-system

haproxyloadbalancer.infrastructure.cluster.x-k8s.io "tkg-113-mgmt-tkg-system" deleted

And the HAProxy VM is removed from vSphere.

We need to edit the vspherecluster object to remove the reference to HAProxy.

kubectl -n tkg-system edit vspherecluster tkg-113-mgmt

Remove the following stanza:

  loadBalancerRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
    kind: HAProxyLoadBalancer
    name: tkg-113-mgmt-tkg-system
vspherecluster.infrastructure.cluster.x-k8s.io/tkg-113-mgmt edited

Lastly, since the IP Address for the HAProxy VM was assigned via DHCP, we need to create a static dhcp reservation for the Kube-VIP address, with a bogus MAC address that will not possibly come into use, and modify the IP range in use to exclude this IP address. This will ensure that the DHCP server does not ever reassign the Kube-VIP address to some other machine. In my environment, I’m running a simple Linux DHCP server and only need to create a stanza similar to the following:

host mgmt-kube-vip {
       hardware ethernet 00:11:22:33:44:55;
       fixed-address 192.168.100.100;
}

And my IP range is modified from:

range 192.168.100.100 192.168.100.250;

To:

range 192.168.100.101 192.168.100.250;

The process for moving from HAProxy to Kube-VIP for the workload cluster is identical to what was done for the management cluster and is also done from the context of the management cluster…the largest difference is that the objects for the workload cluster live in the default namespace.

kubectl get haproxyloadbalancer

NAME                      AGE
tkg-113-wld-tkg-system    4h05m
kubectl get haproxyloadbalancer tkg-113-wld-default -o template='{{.status.address}}'              

192.168.100.103
cat patch.yaml

spec:
  kubeadmConfigSpec:
    files:
    - content: |
        apiVersion: v1
        kind: Pod
        metadata:
          creationTimestamp: null
          name: kube-vip
          namespace: kube-system
        spec:
          containers:
          - args:
            - start
            env:
            - name: vip_arp
              value: "true"
            - name: vip_leaderelection
              value: "true"
            - name: vip_address
              value: 192.168.100.103
            - name: vip_interface
              value: eth0
            - name: vip_leaseduration
              value: "15"
            - name: vip_renewdeadline
              value: "10"
            - name: vip_retryperiod
              value: "2"
            image: registry.tkg.vmware.run/kube-vip:v0.1.8_vmware.1
            imagePullPolicy: IfNotPresent
            name: kube-vip
            resources: {}
            securityContext:
              capabilities:
                add:
                - NET_ADMIN
                - SYS_TIME
            volumeMounts:
            - mountPath: /etc/kubernetes/admin.conf
              name: kubeconfig
          hostNetwork: true
          volumes:
          - hostPath:
              path: /etc/kubernetes/admin.conf
              type: FileOrCreate
            name: kubeconfig
        status: {}
      owner: root:root
      path: /etc/kubernetes/manifests/kube-vip.yaml
kubectl get kcp

NAME                         INITIALIZED   API SERVER AVAILABLE   VERSION            REPLICAS   READY   UPDATED   UNAVAILABLE
tkg-113-wld-control-plane    true          true                   v1.19.3+vmware.1   1          1       1
kubectl patch kcp tkg-113-wld-control-plane --type merge --patch "$(cat patch.yaml)"

kubeadmcontrolplane.controlplane.cluster.x-k8s.io/tkg-113-wld-control-plane patched
kubectl get ma | grep $(kubectl get ma --sort-by=.metadata.creationTimestamp -o jsonpath="{.items[-1:].metadata.name}")

tkg-113-wld-control-plane-c7m6z     vsphere://422aeb95-8657-3557-a77f-6bc25bb0f075   Running   v1.19.3+vmware.1
kubectl delete haproxyloadbalancer tkg-113-wld-default

haproxyloadbalancer.infrastructure.cluster.x-k8s.io "tkg-113-wld-default" deleted
kubectl edit vspherecluster tkg-113-wld

Remove the following stanza:

  loadBalancerRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
    kind: HAProxyLoadBalancer
    name: tkg-113-wld-tkg-system
vspherecluster.infrastructure.cluster.x-k8s.io/tkg-113-wld edited

And another static DHCP reservation needs to be created for the workload cluster’s Kube-VIP address.

host mgmt-kube-vip {
       hardware ethernet 00:11:22:33:44:56;
       fixed-address 192.168.100.103;
}

My DHCP scope looks like the following now (to exclude both the .100 and .103 addresses:

range 192.168.100.101 192.168.100.102;
range 192.168.100.104 192.168.100.250;

And both of my clusters are now upgraded to the 1.2.1 version running Kubernetes 1.19.3.

I ran into an interesting issue when I went to create a new 1.2.1 cluster.

tkg create cluster tkg-121-wld2 -p dev --vsphere-controlplane-endpoint=192.168.100.200

Logs of the command execution can also be found at: /tmp/tkg-20201212T063545219579220.log
Validating configuration…

Error: : workload cluster configuration validation failed: vSphere config validation failed: vSphere node size validation failed: the minimum requirement of VSPHERE_CONTROL_PLANE_NUM_CPUS is 2

Detailed log about the failure can be found at: /tmp/tkg-20201212T063545219579220.log

When I stood up the 1.1.3 environment, I had explicitly defined the number of CPUs to be assigned to the control plane and worker nodes at “1”. While this was acceptable for TKG 1.1, two is the minimum in TKG 1.2.1. These values can be set as command line variables or in the .tkg/config.yaml file. Getting around this was as simple as removing the values (or updating them to “2” or higher. If you run into an error similar to this one, check your .tkg/config.yaml file or env for values similar to the following (and remove/update them).

VSPHERE_WORKER_NUM_CPUS: "1"
VSPHERE_CONTROL_PLANE_NUM_CPUS: "1"

Leave a Comment

Your email address will not be published. Required fields are marked *