Wednesday 11 February 2015

Azure backup service - new retention policy options

Since the Azure Backup service was launched way back in October 2013 there was a desire to move some of our backup jobs over to this service. In part, this was to alleviate the strain on our existing on-premise solution, but also cloud storage as a backup destination makes a lot of sense, especially when you think about long term data retention.  Managing backups on tape with 3rd party services is OK, but can you really be sure (without considerable expense) that your data will be recoverable if/when you ever need to go back to your tape in a year or mores time?

Since the launch of this service though I have not been able to, as yet, have this service running in any production environment.  This has been due to a number of issues but the main one being that the data retention policy was just too short.  This though, has now changed!

Previously you only had 120 recovery points available to be used.  not a problem you would think, until you then discover that you can only have one policy per server...daily or weekly.  If you set a daily backup, say Monday to Friday, you could save your backups for up to 169 days (a little over 5 months) before you would then hit your retention limit.

If you changed to a weekly backup to get a longer retention period you would have backups going as far back as just over 2 years...better, but then these would only be weekly backups, with no daily jobs running.  Not a good place to be!

However, as of 10th February 2015 Microsoft have now addressed this...and in quite a big way.
They now support up to 366 recovery points and also multiple policies per server.  using these new options now gives you the ability to have backups as you would expect within most enterprises with retention periods of many years whilst still retaining at least daily backups of your datasets!
The below screenshot shows the out-of-the-box default policy configured when you install the latest Azure Backup client onto your windows server and register it with your vault in Azure.
It's quite a change as you can see and features not only daily, weekly and monthly policies but also yearly too, with a default retention period of 10 years for your yearly backups!

I'm not saying that this is now perfect.  There are still many areas in need of improvement (such as alerting and centralised views of all of your backup jobs at the very least) but this at least covers the number 1 reason why we would never have used this service for any production jobs previously and is certainly going to be re-evaluated again for some of our services.

It would be great to hear if anyone else has much experience of this service and any other pros/cons you have encountered?




Thursday 20 March 2014

Restoring long file or file paths from volume shadow copy

We have had a couple of instances recently where we have tried to restore files for users using the previous version feature on a Windows client/server as well as from a CIFS file share and because of the very long folder paths and file names used the restore/copy process fails to restore all of the data back to the live environment (it doesn't actually tell you this in some cases!).

On a windows machine you need to expose the volume shadow copy via command line in order to create a persistent link which we could map to so that we reduce the path length ( more indepth article on that process can be found here) but on a NetApp CIFS share what’s the process?

Well its actually much easier than doing it the Windows method J

You need to first expose the snapshot directory of the CIFS volume.  This is done by running the command options cifs.show_snapshot on on the vfiler and then connecting to the root volume share containing the data to map a network drive. Within this folder you will now see the ~snapshot folder containing all of the previous snapshots kept on disk.
Simply browse the snapshot folder for the date/time you require (you can get the correct time info from the snapshot view within the OnCommand tool for the volume) and when you are suitably along the file path, simply map a drive to the folder and another drive to the same directory on the live share and copy/restore away!

If the CIFS share you are trying to restore from is on a vfiler and not the root filer then you will need to enable the snapshot view on the vfiler and not the root filer. In that case, log into the root filer and then type: vfiler context vfilername (or whatever vfiler name yours is called) and then run the options cifs.show_snapshot on on the vfiler.
·        
One last point - Once you have finished the restore don't forget to turn off the snapshot view on the filer/vfiler (unless you want to have this enabled)


Hope this helps anyone else who has to do this in the future…

Tuesday 4 February 2014

A NetApp Flash Accel 1.3 deployment

***Disclaimer*** This is a bit of a long post.  I wanted to get it all down in one go, so apologies if you are still reading this in another 30 mins time :-)

So I have just finished a deployment of the newly released NetApp Flash Accel 1.3 for VMware (as of 30th Jan 2014) and I have to say I am impressed.
Not only was the entire install process straightforward and clean, but the performance benefits of the deployment were being seen within the first hour of VMs being migrated - always a benefit when you have management looking over your shoulder as you deploy a new solution!

For this deployment I was installing 6 new HP BL460c G8 blades, each with 2 x Intel DC3700 400GB SSD drives in a RAID 0 configuration (I am using the MicroSD card for the ESXi OS as these blades only have 2 x 2.5" disk slots).

A couple of pre-requisites exist for this solution, in that you must be running vSphere 5.1 (although you do not need Enterprise Plus license as any 64-bit 5.1 version is supported) and also you need to allow for additional memory requirements of the host depending on how mush SSD you have configured for Flash Accel in each host.  The second point is almost mute though for all but the very heavily used vSphere deployments as the requirements for version 1.3 have dropped from 0.006GB to 0.0035GB GB of RAM per 1 GB of SSD. As described by the manual you therefore only need 3.5GB of additional host RAM for 1TB of SSD...not a lot really considering most modern ESXi hosts are likely to sport 128GB, 192GB or even 256GB+ of RAM these days.

Other considerations for deployment of Flash Accel is that currently it only supports Windows 2008 R2 and higher VMs.  This is because there is an OS agent that needs to be installed on each VM which is being enabled for Flash Accel and currently NetApp have only written a windows driver, although they do state on their webpage that a Linux driver will follow at some point.
Now this is going to be a pain point for some people as Linux VMs are popular (and in many cases, getting more popular) and even our deployment would have benefited further from being able to accelerate Linux VMs (of which we have many) as well.

NB: If you are running Enterprise Plus licences and have upgraded to vSphere 5.5 then you always have vFlash as an option.  This solution from VMware is kernel based and so does not require any drivers and therefore supports Windows and Linux VMs the same. Do bear in mind though that one of the main reasons that you are likely to be deploying Flash Accel is that you are also running a NetApp filer as your back end storage and if you are also utilising the filer for snapshot backups and replication of your VMs then Flash Accel is going to be a much safer bet for this given that they are all about the data coherency (See the NetApp Geek blog here for more on that: LINK)

Once I had all the hardware built and SSDs in placeI was then ready to deploy the Flash Accel Management Console (FAMC).  This is an ova template which is downloaded from the NetApp site and deployed onto your vSphere cluster.  It's a Linux VM which basically manages the deployment of agents to the Hosts and VMs as well as assigning cache disks to the VMs and even showing analytics of the VMs performance.

This is the usual ova kind of deployment with nothing to worry about.  The only pointer I'd give you is to only enter a server name and not an FQDN in the deployment wizard otherwise it fails and you have to re-enter all the info again within the console of the VM. It doesn't break anything but was annoying.

Once the FAMC is up and running I simply hit the IP address in a browser and logged in with default username and password of; u: admin p: netapp

Here I setup the appliance to connect to the vCenter server, entered a username and password for the hosts and then a generic local admin account for the guests VMs; This is used to deploy the OS agent for VMs being cache enabled.  Again all straightforward and simple stuff.

Next I needed to allow the FAMC to perform an inventory of the environment (this took a while as I have a large estate connected to the vCenter) and once complete I could see it list all of the hosts and VMs.

Now in my deployment I had created a dedicated cluster which would be the only one with SSD drives for caching (at present anyway), so I only had to deploy the host agent to the 6 hosts in this cluster.
Again, this is pretty straight forward.  Place the host into maintenance mode and then select the host and push the agent from the FAMC (you upload the latest host and vm agents as part of the setup of the FAMC btw).  The host will be rebooted once installed and then I assign the local SSD on the host to be a caching disk for Flash Accel and finally enable the disk.
Quick hint:  Don't forget to take the host out of maintenance mode once the agent has been installed otherwise you will not be able to assign the SSD disk or enable it.
Perhaps future releases of this console will automatically place the host in and out of maintenance mode but for now you will just get an error message if you forget to do this yourself.

With all of the hosts installed the next step was to install OS agents onto VMs that were to be enabled for caching.
This is something that needs to be given some thought before going ahead and just enabling caching on all Windows VMs in the cluster.  This solution is a read IO caching solution which means that it is only servicing read IO of the VM from the local SSD cache.  VM writes are still going to be placed back to your NetApp controller, ensuring that your data is central and secure as usual but will not be (directly) accelerated.

To identify VMs which are going to be good candidates for Flash Accel I used a SolarWinds product called Virtualisation Manager.  This shows VM IOPs and breaks down read and write IO values easily to see which VMs are going to benefit from the cache.  It can also break this down further into which vdisk is generating the IO so you can tune the cache better and only cache the drives which are generating the read IO.  Other solutions that can do this would be vCenter Operations Manager (vCOPS) or NetApp DFM (now called Operations Manager I believe).

Once the VMs were identified the biggest culprits were the usual suspects; Domain Controllers, SQL servers and web content servers were all seen to be producing read IO higher than average across the environment.

The OS agent installation process is yet another easy process but it does require at least 1 reboot to complete so scheduled downtime is needed to roll this out in production environments.
Quick hint: You need to have a UAC turned off for the installation to work on both 2008R2 and 2012/2012R2.  If UAC is enabled then turn this off by running msconfig and using the UAC tool to turn UAC off, then reboot the VM to apply the changes.
NB: Windows 2012 and 2012R2 also need a registry change made to actually turn UAC off. See the link here for more on that: LINK

With UAC off and the agent installed and the VM rebooted I could then allocate the cache on the VMs. I was careful to only set a size that fit the dataset of the VM being enabled.  It's not always required to allocate a whole disk worth of cache to a VM to get benefit.  For me, I allocated 10GB of cache to our DCs as each of these VMs have around 60GB of disk allocated and approximately 30-40GB on use.  As you can see from the screenshots of the analytics below, this yielded an excellent cache hit rate and meant that I had more cache left over to allocate to other larger VMs too.



The above screenshot is a DC which I knew was going to give good cache hit rates given the nature of what a DC does.

The screenshot below shows a 6 hour period of another DC from enabling the cache:


You can see that as the cache population of the cache rises, the hit rate goes up meaning that more of the read IO is being served from the cache instead of the NetApp controller.

It's early days for the deployment but already I have seen a reduction in read IO on the controllers.  So far so good and once this cluster is populated fully I plan to back-fit our other clusters with cache and do the same again with VMs on them too.

I hope to share some more experiences of this solution soon and will give updates on the success of other systems as they are enabled.

Thanks for sticking around until the end of the post :-)


Thursday 5 December 2013

Getting around vCenter 512bit certificates

Can't get vCenter WebClient  to work on your new Windows 8.0/8.1 machine as the certificate is not trusted?

The issue arises when you have deployed a vCenter server from version 4.0 and by default the self-signed certificate generated by vCenter was 512bits. If you have never replaced your certificate then it will still be 512bits. If you have upgraded your vCenter to 4.1, 5.0, 5.1 or even 5.5 then, unless you have replaced the certificate along the way, it will still be the same 512bit certificate it was when you first started out.

Nothing very wrong with this setup, until MS released KB2661254 which changed the default minimum accepted key length from 512bits to 1024bits to nudge people up the security ladder a little. This resulted in vCenter certificates no longer being supported by clients and hence stopping access to vCenter web portal.

Now the correct way to deal with this is to generate a new vCenter certificate which has at least a 1024bit key (preferably 2048bits) and this will then not only allow the updated clients to function again, but will give you a warm fussy feeling that only running your environment at a higher security level can achieve. This is, however, easier said than done. 
The process of replacing certificates within vCenter is not straightforward. VMware have significantly improved the process by way of the SSL Automation Tool for vCenter 5.0 and above but this is still a fairly lengthy process which is fraught with possible danger (of breaking your vCenter deployment). This needs to be planned and tested and adequate backup and recovery processes put in place before you proceed with doing this on a mature production environment.

A short term fix to get around this is to once again trust the 512bit key and proceed as you were.
The below command can be run from a command prompt on a Windows client to revert the KBs effects:

certutil -setreg chain\minRSAPubKeyBitLength 512

Obviously, doing this will also result in the client trusting ALL 512bit keys that are out there so this should only be viewed as a short term fix whilst you plan the certificate upgrades for vCenter as recommended by VMware.

Once you have resolved the certificate issues and are now sporting a shiny new 1024bit (or 2048bit) certificate, don’t forget to revert the changes above to secure your client(s) again. This can be easily done by removing the registry entry that the above command creates here:

HKEY_LOCAL_MACHINE\Software\Microsoft\Cryptography\OID\EncodingType0\CertDLLCreateCertificateChainEngine\Config\minRSAPubKeyBitLength



Thursday 28 November 2013

vCloud Director sysprep files

Had some fun running up a vCD server this past week so thought I'd post a quick memo to advise of the following changes in vCD between vCD 5.1 and vCD 5.5 regarding sysprep files.

I had been following some excellent blogs on the vCD 5.1 install process from Kendrick Coleman (Install vCD 5.1 & vCD Networking) and applying this to my vCD 5.5 installation.  When I tried to follow the process copy the sysprep files over to the vCD cell I hit a snag as there was no script to run to generate the sysprep files required. This, it turns out, is because in 5.5 they have improved this process and now you simply need to create the directories and place the sysprep files into the directory and away you go.  Not even a service restart is required to start customizing older OSes through vCD.

The folder locations in vCD 5.5 should be (extract taken from the VMware install document for vCD 5.5 - which I should have read more keenly it seems!):

Procedure:

  1. Log in to the target server as root.
  2. Change directory to $VCLOUD_HOME/guestcustomization/default/windows.
    [root@cell1 /]# cd /opt/vmware/vcloud-director/guestcustomization/default/windows
  3. Create a directory named sysprep.
    [root@cell1 /opt/vmware/vcloud-director/guestcustomization/default/windows]# mkdir sysprep
  4. For each guest operating system that requires Sysprep binary files, create a subdirectory of
    $VCLOUD_HOME/guestcustomization/default/windows/sysprep.
    Subdirectory names are specific to a guest operating system and are case sensitive.
    • Windows 2003 (32-bit) should be called svr2003
    • Windows 2003 (64-bit) should be called svr2003-64
    • Windows XP (32-bit) should be called xp
    • Windows XP (64-bit) should be called xp-64
  5. Copy the Sysprep binary files to the appropriate location on each vCloud Director server in the server group.
  6. Ensure that the Sysprep files are readable by the user vcloud.vcloud.
    Use the Linux chown command to do this.
    [root@cell1 /]# chown -R vcloud.vcloud $VCLOUD_HOME/guestcustomization
When the Sysprep files are copied to all members of the server group, you can perform guest customization
on virtual machines in your cloud. You do not need to restart vCloud Director after the Sysprep files are copied.

So there you go...simple if you read the manuals properly in the first place :)

Thursday 14 November 2013

VMworld 2013 - some thoughts...

I had the very good fortune of attending the VMworld 2013 conference in Barcelona in October [for free too, courtesy of one of out IT suppliers :-)] and so thought I'd post a few thoughts and impressions gathered from the conference whilst I still remember them fresh(ish).

I had previously been to one other VMworld, Cannes in 2009, and had been very impressed with the conference and the general quality of the break out sessions and so was looking forward to this conference immensely especially given some of the new technologies which had been revealed during the US event a couple of months prior such as, vSphere 5.5, vFRC and the awesome looking VSAN.

The venue, having now moved to Barcelona, was new but the quality of the event was still top notch!
The break-out sessions are the real reason to go to these conferences and they did not disappoint one bit. Close to the start of the event it seemed that many of the sessions I wanted to attend were fully booked up. At first I was annoyed with this but soon realised that just going to the session and waiting outside before it started pretty much guaranteed you a place in the room anyway (although probably at the back) and I ended up not missing a single session all week.
My favourite sessions were on VSAN, flash caching and some of the new cloud automation suites that VMware are now doing. Flash, btw, was everywhere at this event.  If you were in any doubt about how things are progressing with flash technology, you were left in no doubt at this event that flash is going to be EVERYWHERE pretty soon (if it's not made it into your datacentre already).

VMworld had released a mobile app for your smartphone where you could register for sessions and plan your days activities and this was really useful to have, especially when trying to navigate around the enormous conference suite. they had provided maps, social feeds and even an interactive game in the app. This was a really good improvement from the last VMworld I'd been to and even though there were large screens displaying all of the session info almost everywhere you looked, it was so handy to have when you were sitting in a quiet spot in the 'hang-space' and trying to plan where to go to later that day.

I remember being impressed by the Labs at the 2009 conference and again I really liked the accessibility and ease of which you can get first hand experience of so many of the new tech coming out from VMware.  This was a popular part of the conference, especially on the first day, but later in the event it was fairly easy to get a desk and get onto any lab that you wanted.
They had even provided BYOD lab areas where you would use your own laptop to connect to the lab environment which I thought was a great idea (except that I'd only brought my old Android tablet out with me which wasn't really up for the challenge).

The solutions exchange was where all of the vendors pitched up to show off their wares and this had all of the usual suspects that you would expect.  One very noticeable exception though was Symantec.  I had hoped they would be attending (like they had in 2009) as we use Symantec backup products I had a few things I wanted to discuss around vSphere backups and virtual machine AV protection.  From what I gathered this was probably a political withdrawal due to some support issues with their backup products being a little late to the vSphere 5.1 support party (by nearly a year) and probably didn't want to be on the end of too much public bashing where the people who really felt these issues would likely be.
Having said that,  I read recently of how Symantec are offering support for vSphere 5.5 and future releases within 90 days of GA.  This is a great response to the problem and if they keep it up, they will surely keep vSphere backup customers and gain new ones too! 90 days is a very acceptable time frame by which you would start to think about deploying an upgrade to the GA of a new mission critical infrastructure platform such as vSphere.

Some of the solutions exchange highlights I saw this year were these (in no particular order):

  • Tintri - VM aware storage promising great performance at a price point that makes a lot of sense to seriously question your next SAN upgrade.
  • NetApp Flash Accel integration with VSC 5.0 - This is something which I am currently looking to deploy into production and probably the subjet of my next blog post too!  A great product (which is free to existing NetApp customers) and now fully integrated into the vSphere web client.  Looked very slick and adds to the already excellent VSC product too.
  • Flashsoft - Flash caching for physical and virtual environments.  Reasonable price and even though the vendor is SANdisk, it works with any SSD or PCIe flash device too
  • Infinio - VM caching solution which uses ESXi host RAM instead of SSD devices.  Very nice concept and again another sweet price point too (albeit with the requirement to have significant memory free in each ESXi host which is not that typical in my experience)
There were many great products and demos and I've certainly missed out loads of good ones.  These were just some that I was particularly impressed with and liked what they were doing. 
As I said earlier, flash and storage caching solutions were everywhere in the solutions exchange and this is a space where there will be a huge change to how we are mostly all doing our virtual deployments at present.  It's getting cheaper and the solutions are getting smarter too.  Always a good combination!






Friday 26 July 2013

Failed to open (The parent virtual disk has been modified since the child was created)

Error:
  • Failed to open (The parent virtual disk has been modified since the child was created).
This error came up the other day one a couple of our virtual machines when we tried to power them on after they dies over a weekend.
This issue is in fact covered extremely well by the following KB article here,
and I would highly recommend that you read through the article and get to grips with how the various files fit together which make up the virtual server and it's disks and snapshots etc. as it will help no ends when trying to fix this or similar issues.


Now it turns out that this issue was being caused by our backup software trying to take a weekly tape copy of some virtual machines whilst at the same time a NetApp SnapMirror for Virtual Infrastructure (SMVI) backup and replication job was trying to run.
The two snapshot commands seem to have overlapped and whilst one was being deleted the other was trying to create a new snapshot and so the disk descriptor files were pointing to different snapshot delta files and referencing the wrong parent ID (This all makes more sense when you read the KB article, trust me!).
I'm not to sure why this is allowed to occur but this has now happened around 5 times in our environment over weekends to different vms and as such we have had to be more selective about when we schedule the tape backups to avoid the regular NetApp snapshots (We only do both as we do not hold long disk retention policies offsite and so require tape backups to supplement our disk backup strategy for long term backup retention...a pain, but just the way it is at present. 

To fix this issue the article recommends connecting to the host and manually opening, reading and possibly editing these files using VI but that is not too easy when you are trying to compare multiple files and cross-reference IDs and parent IDs on potentially 3, 4, 5 or more disk descriptor files depending on the number of snapshots and disks the vm has.

My approach is to follow the steps below and use free 3rd party tools to make things easier on yourself too.

Process:

  1. Enable SSH on the ESXi host and open the hosts file wall port for SSH server if not already allowed (do this through vCenter for ease!)
  2. Connect to the ESXi host using WinSCP – This is much easier than going through the command line or vMA service as detailed in the KB
  3. Copy the following files to your local machine to identify the issue:
    1. Virtualserver.log – use this to identify which disk and which snapshot file is reporting the issue
    2. Virtualserver.vmx – use this to identify which snapshots are currently identified as in use
    3. Virtualserver.vmdk – this is the base disk descriptor file containing the first parent CID
    4. Virtualserver-00001.vmdk – this will be the first snapshot delta disk descriptor file and should have the base disks CID as its parent (there may be more than one snapshot file per disk such as 00002.vmdk and/or 00003.vmdk etc. which should all reference the preceding snapshot as their parent until they eventually lead back to the base disks CID)
  4. Use NotePad++ or similar to view all of the files (This utility is excellent for formatting these files into a more readable state and also maintains the files formatting when modifying which you are likely to have to do!)
  5. Make a copy of the files unedited on your machine in case the resolution doesn't work (IMPORTANT!!!)
  6. Make the required changes to the disk descriptor files or the vmx file as required in order to resolve using the information in the KB article. For reference if the snapshot delta file does not contain any data (16mb or less for example) then it may be best to just edit this out of the vmx file and point to an earlier snapshot or the base disk itself in order to bring the vm back online again.
  7. Copy the edited file(s) back to the original location and overwrite as needed using WinSCP
  8. Power on the VM and cross those fingers! :)
  9. If all is good then be sure to delete any unused snapshot descriptor, delta and check point files from the virtual servers directory so as not to affect any future snapshots and to keep things clean.
This is a good and fairly straightforward resolution to the issue. Key to getting this right though is understanding how the descriptor files work and mapping out (often on a piece of paper if needs be) the relationship between each base disk and the snapshot(s) before making any changes.  As mentioned, keep a copy of these files as you may be able to revert any changes made in error just by replacing these files.  Ideally though if you are not certain, always ensure that you have a full backup of all of the files (especially the flat files) before making any changes as per best practices!

Good luck.