How to restore lost one master and etcd on OCP4.6
This test is a kind of hands-on review for the Replacing an unhealthy etcd member in the docs. It show you the progress of the lost one master recovery procedures on the bare-metal platform. The etcd design was changed as etcd operator as of OCP 4.5, so this restore procedures are different with old versions.
After removing one master node vm on a hypervisor host, I try the restore of the lost master. It will be a majority of your masters still available and have an etcd quorum after that. It’s expected status on this test.
Test Environments
- OCP version: OpenShift Container Platform 4.6.1
- Platform and Installation method: Bare-metal hosts and UPI
- Cluster size: Master x3, Worker x3
Backup etcd before test
You should take a backup of etcd or VM snapshot for insurance.
Taking etcd backup on any one master node. It’s required just once on one master, not all master nodes.
Refer Backing up etcd for more details.
Copy the backup files to another host using scp or your favorite tools.
Remove one master VM for test
In this test, I removed master1 node VM through hypervisor operation. It would reproduce lost one master trouble.
- Before removing the master1 in the OCP cluster,
- After removing the master1 in the OCP cluster,
Create new master1 for replacing old one
Check the current status after removing the master1 before restore tasks.
In this case, I will reuse the removed master1 hostname and IP address for new master1 RHCOS installation. Because in order to show you the remained etcd member does not work without removing existing metadata, even though same IP and hostname is configured.
You should approve CSR for new master1. It’s the same, if you add new node in the existing OCP cluster.
Restore unhealthy etcd
After the master1 node restore, you can see new etcd-master1 pod did not start or work well, even though etcd-quorum-guard pod was created on the new master1, and new master1 was configured with the same IP and hostname.
Remove the remained old etcd member for adding new one to existing etcd cluster.
Additionally, you should remove all secrets for old etcd member.
After removing the secrets, you can see the new master1 etcd pod would be running automatically as follows.
If the output from the previous command only lists two pods, you can manually force an etcd redeployment. In a terminal that has access to the cluster as a cluster-admin user, run the following command,
$ oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'
Restore tasks are completed
Verify if new etcd on new master1 is added automatically, and check all the etcd members are healthy as follows.
Thank you for reading.