Nutanix AHV: How to get rid of orphaned 3rd party backup snapshots

Like with any other hypervisor you can also end up with orphaned snapshots in combination with third-party backup tools on AHV.

In general, it is the responsibility of the backup tool to clean up after a backup job run successfully. There can be different reasons why this is not happening and snapshots are left over.

Compared to other hypervisors this kind of snapshots are not visible in PRISM, which makes it hard to delete it manually. However, PRISM has a built in alarm to warn you if these kind of snapshots exists, which looks like this:

Beside these warnings you may also run into an error, while trying to delete a protection domain: Specified protection domain PD-name has xx third-party backup snapshot(s)

To remove these snapshots manually login via SSH to one of the CVMs or the cluster IP address.

The following command gives you a list of all snapshots of the specified protection domain, which are flagged with the type “scoped”. Beside the scoped type, which are the hidden 3rd party snapshots, there is a “regular” type for the snapshots which are created manually via PRISM or via scheduler.

cerebro_cli query_protection_domain <protection_domain_name> list_snapshot_handles="true;scoped"

 

Here you see an example output block of a single snapshot:


snapshot_control_block_vec {
handle {
cluster_id: 32429
cluster_incarnation_id: 1485438944851520
entity_id: 2642390
}
start_time_usecs: 1517493033443226
finish_time_usecs: 1517493034443233
expiry_time_usecs: 3664976681443233
schedule_ids: 2642386
total_user_bytes: 0
entity_vdisk_id_vec: 62463790
snapshot_uuid: "af00e5a5-6c0a-4a0e-9055-39306edee304"
is_out_of_band_snapshot: true
app_consistent_snapshot_hint: false
type: kScoped
task_uuids: "\206\201\266\234\025\214H\305\207\211\346\263\245\370S\217"
cluster_architecture: kX86_64
}
 

Beside the kScoped flag as type, the expiration time of the snapshot is an indicator for a third-party snapshot. If you convert the expiry_time_usecs value into human readable format you properly want to go one with the next steps instead of waiting for this date (which seems to be the default for all 3rd party snapshots):

expiry_time_usecs: 3664976681443233 = Tuesday, February 19, 2086 5:04:41.443 PM

The trick to get now rid of this snapshot is to modify the expiry_time_usecs value with the following command to the current time, which will get this snapshot expired 1 second later.

cerebro_cli modify_snapshot snapshot_uuid=<cerebro uuid> expiry_time_usecs=`date +%s`
 

Replace the <cerebro uuid> placeholder with the snapshot uuid marked in bold in the above example output.

This way works great if you have 1 or 2 snapshots, but may you have a protection domain with 100 or more snapshots like I had, which would drive you crazy to do this manually per snapshot.

To automate this, you can use the following loop, which will basically kills all scoped snapshots within a protection domain:

for i in $(cerebro_cli query_protection_domain <protection_domain_name> list_snapshot_handles="true;scoped" | grep snapshot_uuid | awk '{print $2}' | cut -c 2-37); do cerebro_cli modify_snapshot snapshot_uuid=$i expiry_time_usecs=`date +%s`; done
 

And now you should be able to delete the protection domain via PRISM:

If you have many protection domains, you may want to use the cerebro master page to get an overview over all snapshots.

Just run links http://localhost:2020 on the cerebro master CVM (or on any other CVM, which will provide you a link to the master).

Here you find a list of all protection domains including the number of snapshots:

If you go into one of the protection domains, you get a list of all snapshots including the snapshot type (eg. scoped):

You may have noted, that on the snapshot next-to-last there is a system protection domain with a long ID as name. Some backup tools out there may use this system protection domain for the snapshots instead of the individual user/regular protection domains. The process to delete snapshots within this system protection domain is quite the same. 

Run the following command on the cli to list the system protection domain:

cerebro_cli list_protection_domains list_system_protection_domains=true
 

This will give you an output like this on the end:

protection_domain_name: "050319d9-cc11-4704-9f57-2c8a6a0110b0"
 

This name can now be used like in the above outlined steps.

 

NOTE: Please use the commands above only if you know what you are doing and at your own risk. If you are uncertain, I would strongly recommend involving the Nutanix support.

Credits to my friend @GuidoHagemann from Nutanix for his help with this!

Speak Your Mind

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.