Tuesday, 23 October 2012

Hot Add CPU to a Windows 2008 R2 Enterprise VM

I'm still often caught out by just how cool modern IT really is.

Today for example we had an issue with one of our businesses systems which pretty much took it out of the water.  It seems that the business had deployed a large number of additional users to this system the previous day and by now they were all starting to access the server and place significant demand on it's single vCPU.

For about an hour or more the server had been sitting there at 100% CPU (with the odd blip here and there as it churned its way through all of the work) and for new users the system was so unresponsive that it was effectively dead to them.  The few users who were already on the system and working before this CPU increase were still OK, although they were feeling the effects of the slow performance, and were still inputting data and working on their models etc.

This posed a problem.  On the one hand it was clear that this system needed to be running on more than 1 CPU now given the increased workload but on the other we didn't want to take the server out of the water (even for just a few minutes that it would take) to do this as it would mean cutting access to the users who were in and potentially causing them issues with unfinished modelling and part entered data.

Luckily enough this VM had been deployed as Windows Server 2008 R2 Enterprise.  This one decision, taken by some possibly overzealous engineer who originally built the server, saved the day as it allowed us to simple edit the VMs properties and increase the number of vCPUs allocated from 1 to 2.

Now the really cool part was watching the server via the console with task manager opened at the performance tab.  The changes were completed within vCenter and then within around 5 seconds, the OS popped up a message to state that the data was incorrect and it needed to restart task manager. Clicking OK restarted it and up popped two CPU graphs and almost instantly the CPU levels started to drop to around 55-75% utilization.
I had done this in the lab many times and new this was easily technically possible if ever we needed to do it but we had never had this occur within the few years or so that this has been possible to do. To actually use this feature in 'live' and against a running production system (of some significant importance to the business too) gave you a real sense of satisfaction.
It's great when IT just works.

Kudos to VMware and Microsoft for some pretty neat tech!

Thursday, 11 October 2012

p2v'd Windows 2000 vm with high cpu

We recently acquired some old Windows 2000 physical servers which needed to be brought online in our virtual infrastructure.  It's been a while since I p2v'd any servers as we had pretty much sucked up all of the old physical servers we had since we deploying our first virtual environment and so it was time to dust down my old copied of VMware converter, a trusted utility which had been used almost flawlessly back in those early virtual days.
In fact I chose to download a newer version of this tool from the VMware website and then burn it to a CD to make a bootable disk from which to perform a cold clone of these old servers.

The cloning process was pretty standard as before with a few nice extras now afforded to me in the post migration steps making things even easier.  The newly cloned vm came up and then it was a simple process of going through all of the old hardware applications (HP management agents and the like) and removing them, along with uninstalling the old 'hidden' hardware from the servers previous physical state.
The server then came online and with a few tweeks here and there was ready to go into production again.

It was then that we noticed some issues around the performance of the vm which seemed familiar.  Within the vSphere client the CPU performance of the windows 2000 server was showing as practically 100% almost all of the time, yet within the OS the CPU was showing as idle.
I immediately thought that the HAL was not set correctly as this is a well documented issue, especially with windows 2000 vms, however when I went into device manager, and under computer, the HAL was indeed set to what it should be; 'ACPI Multiprocessor PC'.  This was the same as on the other Win2k vms which had been migrated at the same time and they were not displaying this same CPU issue.

After looking into the issue a little more it seems that the idle thread (which  normally lets the computer save power when not in use) gets stuck in a busy loop and so although the OS believes it is not actually utilizing the CPU, the physical CPU is constantly receiving commands and therefore the CPU demand for the vm is actually 100%. 

This issue was resolved in this case by changing the HAL from 'ACPI Multiprocessor PC' to 'Advanced Configuration and Power Interface (ACPI) PC'.
Select 'ACPI Multiprocessor PC' and right clicking to Properties. Select Update Driver and then in the window which appears, select 'Show all hardware of this device class' to list available drivers as below:
Select the Advanced Configuration and Power Interface (ACPI) PC model and click Next etc to install. 

This process required a reboot of course to complete and I honestly can't stress enough how important it is to backup the vm before making these changes. As a belt and braces approach we cloned the vm first and applied the changes to the clone and then monitored it for a couple of days before applying the change to the live server.
Once the system rebooted the performance was what we expected again as per the other vms migrated at the same time.

I'm not sure how this driver differs from the previous one but swapping the driver made all the difference to this servers performance and as its not going to be around too long anyway I'm happy to leave it this way until the application is moved to a new environment entirely.

Update: The above issue can also sometimes be resolved by re-installing the ACPI Multiprocessor PC driver.  Simply follow the steps above but instead of selecting the Advanced Configuration and Power Interface (ACPI) PC drive, just re-select the same driver, complete the install and then reboot the system.
This is less risky than changing the driver but if this does not work then you should look to change to the advanced driver next.

Thursday, 4 October 2012

Setting Storage Path alerts

Since vSphere 4.0 there has been a large increase in the available alarms which not only come pre-configured but are also available to be created and you can pretty much now create an alarm for almost anything within vCenter.

It still surprises me though why some quite essential monitoring areas are not included within the default set of pre-configured alarms.  One such alarm is the Storage Path Redundancy alarm will let you know when you have lost paths to your SAN storage and what datastores this will be affecting etc.  This is a very simple alarm to setup but also pretty essential to virtually all vSphere implementations these days I'd imagine.

To set up the alarm select the vCenter server in the vSphere client and then go to the 'Alarms' tab.
Select 'Definitions' to see a list of all currently configured alarms and then right click in the section to create a new alarm.
Give the alarm a name ('Degraded Storage Paths' for example) and change the Monitor to 'Hosts' and then choose 'Monitor for specific events occurring on this object, for example, VM powered On'.
On the 'Triggers' tab click 'Add' and then change the Event type to 'Degraded Storage Path Redundancy'.
Next select the 'Actions' tab and Add an action to be performed when this event occurs.  This can either be an email alert perhaps to the storage team or even a task for the ESXi host to perform.
Once set, click 'OK' and the alarm is set.

It's also worth creating another alarm to go along with this once which alerts when one of the ports goes offline too.  That way you get notifications of path redundancy lost or a full port connectivity loss which will help in troubleshooting the issue being experienced.

To set this up, simply create another rule as above but this time set the trigger to be 'Lost Storage Path Redundancy' and set whatever actions you would like.

There are many other good alarms to set depending on what monitoring solutions you may or may not have in place for your virtual environment so its always good to have a look through the list of available alarms and just check that you have everything you need configured before you need it....they're not going to do that much if you've created them after the event!

Tuesday, 2 October 2012

vSphere vmotion network outages

During heavy vmotion operations I was experiencing intermittent network outages of ESXi hosts and some vms running within the cluster.
This seemed to get progressively worse over a period of several weeks until it became almost every time a host was placed into maintenance mode there would be some network outage of some vms and even other ESXi hosts within the same cluster.

After initially looking at the network infrastructure we noticed that there was a large flood of unicast traffic on the vlan which was being shared by vmotion, and some windows based vms, around the time of the vmotions (to be expected in vmotion operations).

Now VMware best practice is to have vmotion and ESXi Management on their own separate vlans or networks but this had never been an issue previously with this cluster which was about 4 years old and had been upgraded over that time from ESX 3.5 to ESXi 5.0 u1 (it's current state). There had been no significant network changes during this period also which could have had a waggling finger pointed at them so it was not obvious how we had come to this issue.
It seemed obvious to start thinking that the gradual changes and growth of the cluster had started to cause this issue for us.  Over the various versions which these hosts have been running the vmotion feature has been greatly enhanced and improved and the amount of simultaneous vmotions a host can support has also increased from 2 to 4 (or 8 with 10Gbe) as can be seen here:

(taken from the vSphere 5.1 Documentation center here)
Network Limits for Migration with vMotion
Operation
ESX/ESXi Version
Network Type
Maximum Cost
vMotion
3.x
1GigE and 10GigE
2
vMotion
4.0
1GigE and 10GigE
2
vMotion
4.1, 5.0
1GigE
4
vMotion
4.1, 5.0, 5.1
10GigE
8

There had also been a sizable growth in the number of hosts and virtual machines in this cluster and the hosts had been increased in capacity etc along the way too.  This all resulted in a much heavier demand for vmotion during the process of placing a host into maintenance mode as often I would be looking at somewhere between 20-50 virtual machines being migrated across the cluster.

It turned out that this was in fact our issue and so as we had spare capacity within our hosts, due to the recent removal of some iSCSI connections to this cluster, we were able to hook up a couple of dedicated vmotion nics per host and placed them into their own vlan away from the management and any other systems.
vSphere 5.0 gives us the ability to utilize more than 1 vmotion nic per host. All that was needed was to create 2 new vmkernel ports on a new vSwitch and have vmkernel port 1 bound to vmnicX as active and vmnicY as standby, then just reverse the configuration for the second vmkernel port.
Once the two new vmotion ports were created and assigned IPs on the new vlan, I just removed the old vmotion port which was in the shared vlan and then that was all of the configuration which was needed.

I performed a few test migrations after that and the performance improvement was easily visible even without measuring it.  We used to have windows vms with 4GB ram move between hosts within 1-2 mins and now we are getting them within 30 seconds.

Best of all, when now entering maintenance mode on a host, even one which is running many vms, we are no longer getting any network outages and the process is a lot speedier too. Happy times again!