Nutanix: Node stucks in Phoenix boot loop after LCM run

The Life Cycle Management within PRISM is an awesome framework to automate all kind of hardware updates of your Nutanix nodes with just some clicks.

LCM in general has become pretty stable since AOS 5.5 and only rarely needs some support assistance.

One situation I have seen now multiple times is, that a Node stucks in a Phoenix boot loop during a BIOS update. LCM within PRISM fails with some kind of error and the current node in maintenance is not coming back online on its own. I would suggest to directly get support involved for assistance. Anyway with the following steps you can get at least the node back online and fully operable again.

The first thing you usually do in such a situation is to check the KVM console via the IPMI interface. LCM leverage the foundation component to boot into the Phoenix ISO. If this process stuck, your console properly looks like this:

You may have already tried to reboot / reset the node but you will always end up in the Phoenix ISO.

Just run the command phyton reboot_to_host.py to boot up your installed hypervisor:

The hypervisor boots now normally and also the CVM is started automatically afterwards. Prior to the reboot by LCM the hypervisor (in case of AHV) and the CVM got placed into a maintenance mode. As in the above describe scenario where the LCM process failed, the maintenance flag is not removed automatically again.

So if you want to get your node fully up and running again you need to perform the following tasks:

1.) End CVM Maintenance Mode

The standard CLI command you usually run for a first “health check” is cluster status. In this scenario here you can see that the affected CVM (10.100.133.22) is shown with a Maintenance flag:

The next step is to switch to the NCLI and run the command host.list and find out the UUID of the CVM, which is in maintenance mode:

After you figured out the UUID run the command host edit id=<insert uuid here> enable-maintenance-mode='false' to disable the maintenance mode, which you will directly see after you run the command as the ‘Under Maintenance Mode’ value shows now ‘false’:

If you are running VMware ESXi or Microsoft Hyper-V you are done here. In case of AHV go on with step 2.

2.) End AHV Host Maintenance Mode

In addition to the CVM the AHV host itself has its own maintenance mode.

If you go to the ACLI and run a host.list you see that the host is not schedulable for running user VMs and also if you run a host.get <hostname> for the details of the host you see a maintenance flag for the ‘node_state’ value.

The command host.exit_maintenance_mode <hostname> gets you easily back to the normal state:

Also PRISM will you the end of the maintenance mode, VMs get moved back to the host etc.

Now all nodes in the cluster are up and running again, so you can now take your time to figure out why LCM failed etc.

NOTE: Please use the commands above only if you know what you are doing and at your own risk. If you are uncertain, I would strongly recommend involving the Nutanix support.

Speak Your Mind

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.