Tuesday 4 February 2014

A NetApp Flash Accel 1.3 deployment

***Disclaimer*** This is a bit of a long post.  I wanted to get it all down in one go, so apologies if you are still reading this in another 30 mins time :-)

So I have just finished a deployment of the newly released NetApp Flash Accel 1.3 for VMware (as of 30th Jan 2014) and I have to say I am impressed.
Not only was the entire install process straightforward and clean, but the performance benefits of the deployment were being seen within the first hour of VMs being migrated - always a benefit when you have management looking over your shoulder as you deploy a new solution!

For this deployment I was installing 6 new HP BL460c G8 blades, each with 2 x Intel DC3700 400GB SSD drives in a RAID 0 configuration (I am using the MicroSD card for the ESXi OS as these blades only have 2 x 2.5" disk slots).

A couple of pre-requisites exist for this solution, in that you must be running vSphere 5.1 (although you do not need Enterprise Plus license as any 64-bit 5.1 version is supported) and also you need to allow for additional memory requirements of the host depending on how mush SSD you have configured for Flash Accel in each host.  The second point is almost mute though for all but the very heavily used vSphere deployments as the requirements for version 1.3 have dropped from 0.006GB to 0.0035GB GB of RAM per 1 GB of SSD. As described by the manual you therefore only need 3.5GB of additional host RAM for 1TB of SSD...not a lot really considering most modern ESXi hosts are likely to sport 128GB, 192GB or even 256GB+ of RAM these days.

Other considerations for deployment of Flash Accel is that currently it only supports Windows 2008 R2 and higher VMs.  This is because there is an OS agent that needs to be installed on each VM which is being enabled for Flash Accel and currently NetApp have only written a windows driver, although they do state on their webpage that a Linux driver will follow at some point.
Now this is going to be a pain point for some people as Linux VMs are popular (and in many cases, getting more popular) and even our deployment would have benefited further from being able to accelerate Linux VMs (of which we have many) as well.

NB: If you are running Enterprise Plus licences and have upgraded to vSphere 5.5 then you always have vFlash as an option.  This solution from VMware is kernel based and so does not require any drivers and therefore supports Windows and Linux VMs the same. Do bear in mind though that one of the main reasons that you are likely to be deploying Flash Accel is that you are also running a NetApp filer as your back end storage and if you are also utilising the filer for snapshot backups and replication of your VMs then Flash Accel is going to be a much safer bet for this given that they are all about the data coherency (See the NetApp Geek blog here for more on that: LINK)

Once I had all the hardware built and SSDs in placeI was then ready to deploy the Flash Accel Management Console (FAMC).  This is an ova template which is downloaded from the NetApp site and deployed onto your vSphere cluster.  It's a Linux VM which basically manages the deployment of agents to the Hosts and VMs as well as assigning cache disks to the VMs and even showing analytics of the VMs performance.

This is the usual ova kind of deployment with nothing to worry about.  The only pointer I'd give you is to only enter a server name and not an FQDN in the deployment wizard otherwise it fails and you have to re-enter all the info again within the console of the VM. It doesn't break anything but was annoying.

Once the FAMC is up and running I simply hit the IP address in a browser and logged in with default username and password of; u: admin p: netapp

Here I setup the appliance to connect to the vCenter server, entered a username and password for the hosts and then a generic local admin account for the guests VMs; This is used to deploy the OS agent for VMs being cache enabled.  Again all straightforward and simple stuff.

Next I needed to allow the FAMC to perform an inventory of the environment (this took a while as I have a large estate connected to the vCenter) and once complete I could see it list all of the hosts and VMs.

Now in my deployment I had created a dedicated cluster which would be the only one with SSD drives for caching (at present anyway), so I only had to deploy the host agent to the 6 hosts in this cluster.
Again, this is pretty straight forward.  Place the host into maintenance mode and then select the host and push the agent from the FAMC (you upload the latest host and vm agents as part of the setup of the FAMC btw).  The host will be rebooted once installed and then I assign the local SSD on the host to be a caching disk for Flash Accel and finally enable the disk.
Quick hint:  Don't forget to take the host out of maintenance mode once the agent has been installed otherwise you will not be able to assign the SSD disk or enable it.
Perhaps future releases of this console will automatically place the host in and out of maintenance mode but for now you will just get an error message if you forget to do this yourself.

With all of the hosts installed the next step was to install OS agents onto VMs that were to be enabled for caching.
This is something that needs to be given some thought before going ahead and just enabling caching on all Windows VMs in the cluster.  This solution is a read IO caching solution which means that it is only servicing read IO of the VM from the local SSD cache.  VM writes are still going to be placed back to your NetApp controller, ensuring that your data is central and secure as usual but will not be (directly) accelerated.

To identify VMs which are going to be good candidates for Flash Accel I used a SolarWinds product called Virtualisation Manager.  This shows VM IOPs and breaks down read and write IO values easily to see which VMs are going to benefit from the cache.  It can also break this down further into which vdisk is generating the IO so you can tune the cache better and only cache the drives which are generating the read IO.  Other solutions that can do this would be vCenter Operations Manager (vCOPS) or NetApp DFM (now called Operations Manager I believe).

Once the VMs were identified the biggest culprits were the usual suspects; Domain Controllers, SQL servers and web content servers were all seen to be producing read IO higher than average across the environment.

The OS agent installation process is yet another easy process but it does require at least 1 reboot to complete so scheduled downtime is needed to roll this out in production environments.
Quick hint: You need to have a UAC turned off for the installation to work on both 2008R2 and 2012/2012R2.  If UAC is enabled then turn this off by running msconfig and using the UAC tool to turn UAC off, then reboot the VM to apply the changes.
NB: Windows 2012 and 2012R2 also need a registry change made to actually turn UAC off. See the link here for more on that: LINK

With UAC off and the agent installed and the VM rebooted I could then allocate the cache on the VMs. I was careful to only set a size that fit the dataset of the VM being enabled.  It's not always required to allocate a whole disk worth of cache to a VM to get benefit.  For me, I allocated 10GB of cache to our DCs as each of these VMs have around 60GB of disk allocated and approximately 30-40GB on use.  As you can see from the screenshots of the analytics below, this yielded an excellent cache hit rate and meant that I had more cache left over to allocate to other larger VMs too.



The above screenshot is a DC which I knew was going to give good cache hit rates given the nature of what a DC does.

The screenshot below shows a 6 hour period of another DC from enabling the cache:


You can see that as the cache population of the cache rises, the hit rate goes up meaning that more of the read IO is being served from the cache instead of the NetApp controller.

It's early days for the deployment but already I have seen a reduction in read IO on the controllers.  So far so good and once this cluster is populated fully I plan to back-fit our other clusters with cache and do the same again with VMs on them too.

I hope to share some more experiences of this solution soon and will give updates on the success of other systems as they are enabled.

Thanks for sticking around until the end of the post :-)