Monday 24 September 2012

Free VMware SRM training videos

VMware have just released a set of free (yup, totally free!) training videos for Site Recovery Manager (SRM) on their website:

http://blogs.vmware.com/education/2012/09/free-site-recovery-manager-training.html

This is a great resource for those wishing to deploy SRM and I would urge all to take a look through the videos before starting your deployments.


Wednesday 19 September 2012

Migrating VMs running on VSS to VDS in a Production ESX cluster (Migrate Virtual Machine Networking…)


With one of our ESXi 5.0 clusters growing to 12 hosts and our Networking team constantly wanting to deploy new vlans like they are going out of fashion it was time to implement a distributed vSwitch (vDS) to the cluster in order to reduce the administrative overhead of adding all of the port groups to each vSwitch on each host (not really the case but it’s always good to keep up with the professional dogging of those poor network guys eh?). 

The process to deploy a new vDS to a cluster is pretty straight forward and you can follow the process from within vCenter here: VMware KB

Once created the next step was to create each of the vlan port groups onto the VDS.  Here I simply setup the vlan with the same name as used currently on each of the VSS (you can do this as the name has ‘(dvswitch)’ appended to it anyway so it keeps these separate from the existing port groups when selecting network connectivity when editing VMs) and the same vlan ID entry etc.

Next I moved 2 of the 4 x 1Gb adaptors from each VSS into the dvuplink ports on the VDS.  This then allowed the existing VSS port groups to continue to service network requests for all of the running VMs and also allowed me to start moving VMs from the VSS to VDS .

To migrate the VMs from their current port groups on each VSS to the newly created port groups on the VDS you can use the excellent Migrate Virtual Machine Networking utility which manages the bulk modifications to VMs.

To do this simply so to the networking screen in the VI client and right click on the new VDS and choose ‘Migrate Virtual Machine Networking…’

Next select the source network from the drop down list (this is the current port group that you want to move VMs off of) and then select the destination network (the corresponding port group on the dvSwitch)

Click Next and you can now select all or some of the VMs to be migrated.  If you select all of the VMs you’ll be able to sit back and watch as each VM is modified in turn and moved over.  It really is simple and best of all results in no network outage to the running VMs.
I migrated several hundred VMs across our various vlans to the distributed switch without one little blip! 
Then it was a just a matter of going through each of the hosts and cleaning up the old port groups and vSwitches which were no longer being used.

Sunday 16 September 2012

vSphere VM deployment customizations

A small but annoying thing had started to happen to your deployments of Windows 2008 R2 vms in our production environment recently. Whenever we deployed a new vm and used our pre-saved customization specification the vm would be deployed as expected except that it did not join the new vm to our production domain.
The image would be customised, the server name changed, IP settings applied, administrator password set etc but it would no longer join the vm to our windows domain.

Alarmingly, although the option was set within the specification there were no errors recorded for this in the logs on the newly deployed vm (these can be found at c:\windows\temp\vmware-imc\guestcust.log) which I would have expected.

The answer it turned out was very simple.  The customization had been modified to have domain\username in the username field of the domain customization properties.  Although this looks perfectly reasonable to have in a windows environment this actually needs to be just the username of the domain account which will be joining the vm to the domain.

After changing the pre-saved customization to just the account name and re-entering the password I fired off a test deployment and voilà, 1 windows vm deployed and sitting on our production domain as before!

Wednesday 12 September 2012

Virtual Machine disk consolidation fails with I/O error on change tracking file

A vm was displaying the warning that 'Virtual machine disks consolidation is needed' which is a nice feature of vSphere 5 which now actively tells you about this issue (It's always been there in previous releases but never highlighted in this way until 5.0).

We often get this issue as we use a snapshot backup technology to backup our vms each day and for some reason or other sometimes the remove snapshot process does not complete properly and we get this situation where the snapshots are removed but the snapshot files are still present and referenced in the vm. See the following VMware kb article for details.Consolidating snapshots in vSphere 5

Usually this is a simple process of right clicking the vm, selecting 'snapshot > consolidate' to have the snapshot child disk files consolidated back to the parent disk file but in this case the consolidation failed with the error message: 'A general system error occurred: I/O error accessing change tracking file'.

After some investigation I found that our backup system had a lock on one of the files and so I was able to release the file from the backup software and then re-run the consolidation which completed and all was good again!
The troubleshooting steps to identify the locked file can be found here: Investigating virtual machine file locks on ESX/ESXi

Previously I've also been able to resolve the issue of not being able to consolidate vm disks by creating a clone of the troubled vm and bringing it up as the active vm and then deleting the old one. Not always possible though in a production environment!

Tuesday 11 September 2012

vCenter Operations Manager not displaying Risk or Efficiency data

So finally made the upgrade from CapacityIQ and deployed VMwares new vCenter Ops Manager in it's place.  The upgrade process of deploying the new Appliance was straight forward and error free.

During the installation process you have the option to import your old data and settings from the CapacityIQ appliance into the new vCOM database so you don't lose any of the existing trending information etc.

This process worked like a charm but after a few days or so I noticed that the Risk and Efficiency data never populated on the dsahboard screen and I was not able to get any Capacity or Trending information.

After looking at a few blogs and the excellent VMware Communities I was still not able to find why this was not working and so logged a support call.  The answer was simple and when thinking about it, obvious.
The below was the summary provided to me from support: 

To calculate time remaining and capacity remaining metrics, there are overall 5 resources that we consider
 
* cpu
* memory
* disk IO
* disk space
* network IO
 
However, these 5 resources do not apply to all object types. So under the hood, we actually consider a subset of the applicable resources for each object type. For example, for datastore object we consider only selected resources out of disk space and disk IO resources ; for vm object we consider only selected resources out of cpu, memory, disk space resources; for host and up, we consider selected resources out of all resources. For a given object type, if all the applicable resources are unchecked (i.e., none are selected), the metric calculation module is unable to figure of the metric dependency and unable to calculate the time remaining or vm remaining values.


Now in CapacityIQ we never cared for disk Capacity as a factor in our host capacity reports as we run several SANs which are attached to our ESXi cluster and this space is only carved up and added to the environment on a per-need basis. We were mainly only concerned about CPU and Memory primarily and so these settings were not selected in CapacityIQ and so did not come accross to the new vCOM deployment when we imported the settings and data from CapacityIQ.

In our case, simply adding 'Disk Space capacity and usage' and/or 'Disk I/O capacity and usage' in the "Capacity & Time Remaining" configuration panel solved the problem!
When the Analytics process next run on the system (1am by default) the Risk and Efficiency areas populated and all was well.

The support guy did mention that this is being fixed in version 5.6 so that at least one of the applicable resources for each object type is checked, but for now it's a manual process.


Thursday 6 September 2012

VMware SRM 5 recovery plan environment scripts

How to create a recovery plan script in SRM5 that will perform different tasks depending if the recovery plan is in test mode or recovery mode.



It's pretty easy to add scripts to recovery plans in SRM5 to perform all sorts of tasks in recovered environments or VMs but what if you need to have the script do something different when it is run in a test scenario like add some test environment specific routes or add some host file entries to allow recovered VMs to talk to one another in a non-production LAN (no DNS or Gateways exist for example)?  Well thanks to SRM5 you can make use of some environment variables which are injected into the recovered VMs by the SRM service in order to do just that!

The main variable to look at here would be VMware_RecoveryMode. This variable has a setting of either test or recovery depending on how the recovery plan is being run at the time and so can be referenced in your script to act differently according the the value of this variable.

A basic example of this can be found in the below script which is a simple batch file...

IF %VMware_RecoveryMode% EQU test (Goto TestRun) Else (Goto OtherRun)
 
:TestRun
for /f "delims=: tokens=2" %%a in ('ipconfig ^| findstr /R /C:"IPv4 Address"') do (set tempip=%%a)
 set tempip=%tempip: =%

route add 10.10.1.0 mask 255.255.255.0 %tempip% -p
route add 10.10.2.0 mask 255.255.255.0 %tempip% -p

Echo Routes Applied to Test environment on %date% at %time% >> c:\srm\srmlog.txt

Echo 10.99.53.13 server1.company.com >> %windir%\system32\drivers\etc\hosts
Echo Host file entries Applied on %date% at %time%>> c:\srm\srmlog.txt
EXIT

:OtherRun
IF %VMware_RecoveryMode% EQU recovery (Goto RecoveryRun) Else (Echo an unexpected result occurred on %date% at %time% >> c:\srm\srmlog.txt)
EXIT

:RecoveryRun
Echo Recovery started on %date% at %time% >> c:\srm\srmlog.txt
EXIT

This script checks to see if the recovery mode is 'test' and if it is then proceeds to run some things under the :TestRun section.
If the mode is not test then it checks to see if it is in recovery mode and again if it is it then runs some things under the :RecoveryRun section.
If it for some reasons doesn't see either test or recovery in the variable it will just write a simple log file to C:\SRMfolder of the VM running the script.

There are other environment variables available to play with too which can be found in the SRM administrators guide so hop over to VMware and check it out