Friday 26 July 2013

Failed to open (The parent virtual disk has been modified since the child was created)

Error:
  • Failed to open (The parent virtual disk has been modified since the child was created).
This error came up the other day one a couple of our virtual machines when we tried to power them on after they dies over a weekend.
This issue is in fact covered extremely well by the following KB article here,
and I would highly recommend that you read through the article and get to grips with how the various files fit together which make up the virtual server and it's disks and snapshots etc. as it will help no ends when trying to fix this or similar issues.


Now it turns out that this issue was being caused by our backup software trying to take a weekly tape copy of some virtual machines whilst at the same time a NetApp SnapMirror for Virtual Infrastructure (SMVI) backup and replication job was trying to run.
The two snapshot commands seem to have overlapped and whilst one was being deleted the other was trying to create a new snapshot and so the disk descriptor files were pointing to different snapshot delta files and referencing the wrong parent ID (This all makes more sense when you read the KB article, trust me!).
I'm not to sure why this is allowed to occur but this has now happened around 5 times in our environment over weekends to different vms and as such we have had to be more selective about when we schedule the tape backups to avoid the regular NetApp snapshots (We only do both as we do not hold long disk retention policies offsite and so require tape backups to supplement our disk backup strategy for long term backup retention...a pain, but just the way it is at present. 

To fix this issue the article recommends connecting to the host and manually opening, reading and possibly editing these files using VI but that is not too easy when you are trying to compare multiple files and cross-reference IDs and parent IDs on potentially 3, 4, 5 or more disk descriptor files depending on the number of snapshots and disks the vm has.

My approach is to follow the steps below and use free 3rd party tools to make things easier on yourself too.

Process:

  1. Enable SSH on the ESXi host and open the hosts file wall port for SSH server if not already allowed (do this through vCenter for ease!)
  2. Connect to the ESXi host using WinSCP – This is much easier than going through the command line or vMA service as detailed in the KB
  3. Copy the following files to your local machine to identify the issue:
    1. Virtualserver.log – use this to identify which disk and which snapshot file is reporting the issue
    2. Virtualserver.vmx – use this to identify which snapshots are currently identified as in use
    3. Virtualserver.vmdk – this is the base disk descriptor file containing the first parent CID
    4. Virtualserver-00001.vmdk – this will be the first snapshot delta disk descriptor file and should have the base disks CID as its parent (there may be more than one snapshot file per disk such as 00002.vmdk and/or 00003.vmdk etc. which should all reference the preceding snapshot as their parent until they eventually lead back to the base disks CID)
  4. Use NotePad++ or similar to view all of the files (This utility is excellent for formatting these files into a more readable state and also maintains the files formatting when modifying which you are likely to have to do!)
  5. Make a copy of the files unedited on your machine in case the resolution doesn't work (IMPORTANT!!!)
  6. Make the required changes to the disk descriptor files or the vmx file as required in order to resolve using the information in the KB article. For reference if the snapshot delta file does not contain any data (16mb or less for example) then it may be best to just edit this out of the vmx file and point to an earlier snapshot or the base disk itself in order to bring the vm back online again.
  7. Copy the edited file(s) back to the original location and overwrite as needed using WinSCP
  8. Power on the VM and cross those fingers! :)
  9. If all is good then be sure to delete any unused snapshot descriptor, delta and check point files from the virtual servers directory so as not to affect any future snapshots and to keep things clean.
This is a good and fairly straightforward resolution to the issue. Key to getting this right though is understanding how the descriptor files work and mapping out (often on a piece of paper if needs be) the relationship between each base disk and the snapshot(s) before making any changes.  As mentioned, keep a copy of these files as you may be able to revert any changes made in error just by replacing these files.  Ideally though if you are not certain, always ensure that you have a full backup of all of the files (especially the flat files) before making any changes as per best practices!

Good luck.