Archive

Author Archive

Some quick thoughts about backup

November 11, 2010 1 comment

This is a summary I wrote for someone else, not my usual blog entry, however it does encapsulate my thoughts around the benefits of NetApp’s implementation of replication based backup. I’ll try to get to a more technically focussed version soon.

Replication Based Backups

Exponential increases in data combined with increased storage density techniques means that traditional “Bulk Copy” based method of backup are no longer able to address the growing backup challenges of a modern IT environment. Even backup architectures which are based on an “incremental forever” basis may find that the time it takes to move whole files from remote locations over slow and WAN links hit scalability limits as the amount of data at the remote sites increases.

Ultimately, the most promising technology to resolve this involves replicating only changed data blocks from primary data sources to secondary arrays in other physical locations. These secondary arrays are equipped with high-density low cost disk drives to provide the large amounts of raw capcity in the densest possible footprint. Once the data has been sent to the remote array, the solution can then perform various kinds of data manipulation to store multiple recovery points within a small data storage footprint. This class of technology generally requires that the primary storage is re-hosted on intelligent storage arrays, or that new agents and secondary storage arrays and backup systems are implemented that support this advanced functionality.

This has the advantage that only changed blocks will be moved from primary storage through to the secondary storage on the NearStore. This reduces both the amount of storage that needs to be provisioned, and allows for the data to be sent over low bandwidth high latency network connections. Because of this the secondary copies of the data will can be stored automatically in an offsite location without needing a second two step process as is commonly required with tape backups.

With robust data storage architectures, multiple logical data points, and offsite copies, these technologies can provide almost all the benefits of a tape based backup solution.

Customer Backup Considerations

Proven Scalability

While a number of companies have been changing their traditional backup engines to leverage the benefits of replication based storage, NetApp pioneered this technology with the release of industries first ATA based backup to disk appliance in 2002. Since then NetApp has deployed this technology in thousands of locations world-wide many of which protect critical data estates measured in Petabytes

Centralised Policy Based Protection

The advanced data protection capabilities provided by NetApp requires a correspondingly advanced set of management methodologies and tools to fully exploit the benefits of replication based data protection.  Protection manager was designed for replication based backup,  and provides an integrated way of managing both backup and disaster recovery in a single pane of glass through the following functions and features

  • Discovery.  Detects new volumes not protected and presents as “unprotected data” in the Protection Manager UI.
  • Policy creation. Creates policies for data protection in a wizard-driven graphical process and then calls lower level NetApp tools for execution of the replication process
  • Monitoring. Monitors the whole replication process, watching the capacity and performance against policy, and ensures that protection policies are not out of compliance
  • Visualization.  Provides discovery and mapping views including drilldown and management by exception
  • Reporting. Offers status and health reporting such as a “data transfer report” to identify transfer amount, performance metrics, and duration of transfer for replication processes
  • Virtual machine support. Support through Open Systems SnapVault includes VMware ESX, Microsoft Hyper-V, and Citrix XEN
  • Application integration.  Integrates with SharePoint, SQL Server, Microsoft Exchange, Oracle, and SAP via NetApp SnapManager
  • DR task automation.  Automates tasks, leverages  templates, and provides ongoing monitoring with subsequent reporting to those in authority
  • DR readiness. Monitors resources for changes that could compromise a disaster recovery and proactively communicates them to administrators for remediation
  • One-button failover. Provides continued data access to users, even in the event of a disaster

Usable Copies

Unlike typical backup applications, snapvault always keeps the data in it’s original usable format that can be accessed by open industry standard protocols and methods. Files can be accessed using CIFS, NFS or HTTP, LUNs can be accessed by iSCSI or Fibre Channel, all without having to restore the data back to the original location (which may destroy good data), or find alternate space to recover the file / data object.

Ease of Restore

Usable copies also provides a self service restore capability that reduces recovery times (RTO), decreases helpdesk calls, and increases end user faith in the backup process. This in turn reduces counterproductive end user driven backup strategies and reduces both infrastructure and business costs.  Usable copies also allow backups to be verified for correctness, and provides easy ways of performing deep content searching of backups for legal and other data discovery requests.

Tape Integration

Because the backup data remains in the same format used for traditional primary storage, the high speed NDMP based dump and mirror to tape options used by thousands of companies around the world to protect their NetApp primary storage . These long term archival copies can be sent to tape under the control of traditional backup systems such as NetBackup, TSM or CommVault  leveraging existing knowhow and infrastructure, while minimising costs associated with tape management and off siting

 

 

Categories: Uncategorized

Data Storage for ECM – Part 1 – Unified Storage

September 15, 2010 4 comments

The last few months have been interesting for me,  as my new job role involves a lot of work with alliance partners, many of whom either didn’t know anything about NetApp, or where they did know something it was along the lines of “Oh yeah, the NAS company”. In many respects, it’s a lot easier to explain what we do when someone has an open mind, as pre-conceived notions are often hard to budge, and telling someone they’re wrong is rarely a good way to start a trusted relationship. Even though I report up through our local director of marketing, my soul is still that of an engineer, so when it comes to describing what NetApp does, and why that’s important I tend to go straight to “Well, we still sell NAS, and that’s a big part of what we do, but we really sell is Unified Storage” at which point I expect to see the “and I should care about this because …” look

I’ve been seeing this look quite a bit recently, mostly because many of the people I speak to also get briefs from other storage vendors, and they too have suddenly started talking about “Unified Storage” without really understanding it or explaining its relevance to datacenter transformation. A good case in point was the opportunity I had to speak at the local VMware seminar series where I shared a stage with VMware, Cisco, and EMC. All of us got our 7.5 minutes  to explain how we helped accelerate our customers journey to the cloud. VMware went first, followed by EMC, then Cisco and then me..

I’d prepared two slides for my 7 minutes focussing on our key differentiators, Unified storage, tight VMware integration with advanced storage features, deduplication and storage efficiency, Secure Multi-Tenancy, Cisco validated designs, Backup and recovery, and waited happily to see if EMC would come out with their usual pitch.

Boy, was I surprised … EMC’s pitch was Unified storage, deduplication, tight VMware integration with advanced storage features, deduplication and  storage efficiency, security and Cisco validated designs … what the ????, had I suddenly slipped into a parallel universe ? Had EMC, a company fairly well known for pushing seven different kinds of storage with forklift upgrades  suddenly capitulated and acknowledged that the approach NetApp had been pushing for so many years was actually right ? Was Chuck Hollis about to come on stage and apologise for blatant manipulation of social media and comment filtering ?

Now while I could have picked holes in their story by pointing out that at least from a VMware perspective they don’t have deduplication, that their advanced integration with VAAI hadn’t been released, there was no Cisco Validated Design for vBlock, and that the RSA stuff had no integration at the storage layer, nobody is really interested in hearing vendors denigrate each other, and I only had 7 minutes to figure out how to show our unique ability to help customers in the face of the most shameless “me too” campaign I’ve ever seen.  During that 7 minutes there was one thing that really struck me. EMC has no real concept of why unified storage is important. Their concept of unified storage was something that allowed connection by Fibre Channel, iSCSI, CIFS and NFS and had a nice GUI. Having worked at NetApp for a number of years, I was surprised at how they’d missed the point completely. Almost everyone at NetApp knows that these are good features to have (we’ve had them for over 10 years now), but we also know that by themselves, they have only limited benefits. I’ve had a little while now to think about this, and it’s become clear to me that for other vendors, Unified storage is not a strategic direction, but a tactical response to NetApp’s continued success in gaining market share. This becomes even more obvious by taking a look at their storage portfolios

NetApp EMC Dell HP IBM HDS
Entry Level NAS / NAS Gateway FAS Iomega Windows Storage Server Windows Storage Server N-Series (OEM) Windows Storage Server
Entry Level SAN / iSCSI FAS Celerra NX Equallogic MSA 

Lefthand

N-Series (OEM) 

DS3xxx (OEM)

AMS
MidRange NAS FAS Celerra NS Celerra (OEM) Polyserve ? N-Series (OEM) BlueArc(OEM)
Mid-Range SAN FAS CLARiiON Equallogic 

Celerra (OEM)

Lefthand 

EVA

3PAR

DS4xxx/5xxx (OEM) 

N-Series (OEM)

XIV

AMS
Archive & Compliance FAS Centera Centera (OEM) HP RISS FAS HCP
Backup to disk platform FAS DataDomain 

Avamar

HP VTL 

HP Sepaton (OEM)

Diligent VTL Dilligent (OEM)
Storage Virtualisation Gateway FAS (V-Series) Invista 

VPLEX

SVC 

N-Series Gateway (OEM)

USP-V
Object Repository StorageGRID Atmos StorageGRID (OEM) StorageGRID (OEM) HCP
High End  / Scale Out FAS / FAS (C-Mode) V-Max V-Max (OEM) USP-V (OEM) DS6xxx/8xxx 

XIV

USP-V
Mainframe N/A V-Max V-Max (OEM) USP-V (OEM) DS8xxx 

XIV

USP-V

Now, if you match one of those arrays against the workload they were designed for, you’ll probably get a pretty good result. In a static, reasonably predictable environment without much change, you could make a reasonable argument that this was the best approach to take. You built a silo around an application or function, and purchased the equipment that matched that function. I’ve seen more than one customer that had every product in a vendors portfolio, and seem to be fairly happy, or at least have been until fairly recently.

The problem with these narrow silo’ed approaches is that each silo creates new inefficiencies and dedicated areas of whitespace in both capacity, performance. For example, there is no way of taking excess capacity allocated to a  backup to disk appliance and start using it for CIFS home directories, nor is there a way of taking the excess IOPS capability of temporarily idle disk archive and allocate those IOPS to another application undergoing an unusual workload spike such as a VDI bootstorm.

But for me, the biggest area of waste is that of management. Each of these silo’s tends to get its own set of administrators and workflows, each of which may, or may not work in harmony with the other. Most of us have experienced the bitterness and waste of IT turf wars, and the traditional vendors not only encourage, but depend on and help  maintainin these functionality silos, as it allows a divide and conquer sales model that benefits the vendor far more than the customer. If there was a book entitled “How to build an inflexible and wasteful IT infrastructure”, I imagine that encouraging and spreading “IT functionality silos” would fill up the first few chapters. Even though there are a bunch of people who have been quite happy with this status quo, and the business processes from budgeting and product selection all the way through to procurement and training that entrenches this model, things are changing, and they’re changing a lot faster than I thought they would.

A lot of credit for this change has to go to vendors like Microsoft, Cisco and VMware whose products have blurred the lines of these traditional silos. Virtualisation at both the compute and network layers have driven the kinds of cross functional change CIO’s have been crying out for, and in the wake of this, Unified storage finds its natural fit ; not because of its support multiple protocols, but because these environments require the kind of  workload agility and managment simplicity that only a truly Unified storage offering can fully satisfy.

But it’s not just server and desktop virtualisation and other forms of shared infrastructure where unified storage is a natural fit. Almost any “multi-part” or landscape style application can benefit too, not just because of the flexibility and efficiency, but more importantly because of the fact that these environments are really hard to protect effectively. A really good example of this is Enterprise Content Management Applications such as FileNet and SharePoint

Typically these applications have

  • Content Servers
  • Business Process Workflow Engines / Servers
  • Database servers
  • Index Servers
  • Content Servers

In a large installation, there will be many of these servers and multiple databases, indexes and content repositories to cater for scalability and in some cases, the tyranny of distance (latency is forever).

In a “traditional” silo’d model this data would be stored on two or possibly three different kinds of storage arrays, each with its own method of backup and replication, most of which depend on some form of “bulk copy” backup method as the primary form of logical data protection. The effect of this is that backing up these ECM systems on traditional storage architectures is almost impossible. While I’ve been talking to customers about this for a few years, recently there seems to have been a big increase in customers seeing these problems. In one case a design review for a Petabyte scale SharePoint implementation identified that if a critical index was lost the entire infrastructure would be effectively unrecoverable, and that there was no effective backup capability. In another discussion I had today around redesigning data protection, a brief mention on ECM created more interest than almost anything else simply because of the difficulty of backing their Documentum system.

Truly unified storage not only allows data to be stored using multiple protocols, and provides rich functionality like deduplication and compliant WORM storage which makes it a logical choice for  ECM solutions, but more importantly it also provides a single integrated method for protecting that data in a way that is application-consistent without the need for a “cold” backup. And, you guessed it, NetApp can do that with ease, whereas other vendors’ versions of what they are calling unified storage would find that challenging (to be kind).

In the next few posts, I’ll take a deeper dive into exactly what NetApp does for Enterprise Content Management, with a focus on why Unified Storage is such a good match, and what we do to protect a company’s most important data assets.

Categories: Archive, Data Protection, Value

There but for the grace of god ..

September 4, 2010 1 comment

There are few incidents that can truly be called disasters ; things like the World Trade center bombing, the Boxing Day tsunami, and the Ash Wednesday bushfires. Whether  or not you think a major failure in IT infrastructure can be called a disaster  just like those true tragedies, the recent and very public failure of the EMC storage infrastructure in the state of Virginia is the kind of event none of us should wish on anyone.

While we all like to see the mighty taken down a peg or two, there’s a little too much schadenfreude on this incident from the storage community for my taste. Most of us have had at least one incident in their career that we are all very glad never got the coverage this has, and my heart goes out to everyone involved… “There, but for the grace of God go I … “

I must say though, I’m a little surprised on what appears to be a  finger pointing exercise with a focus on operator error, even though it would confirm my belief that Mean Time Before Cock-up (MTBC) is a more important metric than Mean Time Before Failure (MTBF). Based on the nature of the outage and some subsequent reports, it looks like there was a failure of not just one but two memory cache boards in the DMX3. If so I’d have to agree with statements I’ve seen in the press saying the chances of that happening are incredibly unlikely or even unheard of. In a FAS array this would be equivalent to an NVRAM failure followed by a subsequent NVRAM failure during takeover, though even then at worst, the array would be back up with no loss of data consistency within a few hours. Having said that, the chances of either of these kinds of double failure events happening are almost unimaginable, but certainly, as recently shown, not impossible. How ” an operator [using] an out-of-date procedure to execute a routine service operation during a planned outage” could cause that kind of double failure is kind of beyond me, and has changed my opinion on the DMX’s supposedly rock solid architecture.

What I believe happened was not a failure of EMC engineering (which I highly respect), or even a failure of the poor tech who is followed the “outdated procedure” but rather it was a ”failure of imagination”.

In this case the unimaginable happened, an critical component that “never failed”, did fail, something which provides a valuable lesson to everyone who builds, operates and funds mission critical IT infrastructure. Regardless of whether the problem was caused by faulty hardware,  tired technicians, or stray meteorites, there really is no substitute for defense in depth. As a customer / integrator that means

  • redundant components
  • redundant copies of data on both local and remote hardware,
  • well rehearsed D/R plans.

It doesnt matter what the MTBF figures say, you have to design on the assumption that something will fail and then do everything within your time, skill and budget to mitigate against that failure. If there are still exposures, then everyone who is at risk from those exposures needs to be aware of them and what risks you’re taking on their behalf. We wouldn’t expect anything less from our doctors, I don’t see why we shouldn’t hold ourselves to that same high standard.

As vendors it’s our responsibility to make the features like snapshots, mirroring, and rapid recovery affordable, and easy to use, and do everything we can to encourage our customers to implement them effectively. From my perspective NetApp does a good job of this, and that’s one of the reasons I like working there.

As more infrastructure gets moved into external clouds, I think its inevitable we’re going to hear a lot more about incidents like this as they become more public in their impact.  Practices that were OK in the 1990′s no longer work in large publically hosted infrastructures when many of the old assumptions about deploying infrastructure dont hold true.

Hopefully everyone responsible for this kind of multi-tenant infrastructure is reviewing their deployment to make sure they’re not going to be next week’s front page news.

Categories: Unplanned downtime

Data Storage for VDI – Part 10 – Megacaches

Megacaches

More recently a range of products have come to the market that take advantage of the increasing affordability of non volatile memory (particularly SLC Flash), to create caching architectures that change the rules for modular storage (in no particular order)

  • PAM-11 / FlashCache
  • Sun 7000 Logzilla and Readzilla
  • FalconStor Flash SAN accelerator (using flash modules from Violin)
  • IBM EasyTier / Something to do with SVC
  • EMC FAST Cache
  • Atlantis Computing vScaler
  • Nimble Storage
  • Lots more to come …

While I’d love to go into the details of each of these and compare the features and benefits of each technology, a lack of time and detailed information makes this really hard to do. Also, as a general principal, I don’t think that its wise for an employee of one vendor to make a lot of assertions about another vendors technology. I have enough trouble keeping up with what’s happening at NetApp without trying to gain deep subject matter expertise with, for example, HP or EMC’s technology. Having said that I do think contrasting two different approaches can be useful, so for that reason I’ve decided to deviate from that principal, and will compare as diligently as I’m able NetApp’s FlashCache and EMC FASTCache.

I’ve  included FlashCache for obvious reasons, there is already more than a Petabyte of it out there, I’ve been analyzing it for about a year now, and have I access to the engineering documentation,. I chose FAST Cache because being an EMC product means the marketing engine behind it will make it widely known and the engineering will be solid. The market presence and differing approaches of both of these technologies make them a fairly good yardstick against which the other mega-cache technologies will compare themselves..

Part of the reason I took so long to write this post was that I  spent a fair time trying to characterise the likely performance benefits of a FASTCache solution, which as a competitor is a fairly dangerous exercise.. I’ve tried to be even handed and fact based when doing this and have disclosed where possible the sources of my information, however if you believe I’ve misrepresented the technology please let me know, this is not about vendor bashing, it’s about establishing what I hope is a fair basis of comparison.

Doing this in an even handed fashion was particularly hard because a lot publically available information is either incomplete, or somewhat contradictory. I know that is an industry wide problem, but this is one area where there seems to be a lot more marketing material  than engineering substance. The main sources of my information were blog posts by EMC employees and integrators as well as an official EMC technical report, the details of which, and my takeaways from them are as follows.

How Fast is FAST for VDI ?

Chad Sakac quoted here in relation to the speed of writing data in various raid configuration  “(I’ve added SSD with 6000 IOps as commented by Chad Sakac).”  while I respect Chads comments to do with EMC’s integration with Vmware, I think he’s might be a little off here, especially given that this comment was made 6 months ago, long before FAST Cache was announced.

Mark Twomey (StorageZilla) says that EFD’s have no additional benefit for writes (I assume this applies mostly to Symmetrix which already does good write optimisation) quoted here where he says “The thing most people don’t understand about Flash is that writes aren’t really all that much faster to a good SSD than they are to a regular disk drive. And thus, predicting where writes are going isn’t an objective of FAST”

or Randy Loeschner who also works for EMC and seems to know his way around a database where he says on his blog “Solid State/Enterprise Flash Drives are similar in Write Performance to 15K Fibre Channel disks, but in READ scenarios are capable of 2500 or more READ IOPs.”

I also checked any available benchmarks and found the an EMC document that contained reasonably useful data, though even that seemed to contradict itself with regard to the number of IOPS you could get out of an EFD. Says that an Enterprise Flash Drive (EFD) can get 2,500 IOPS per drive, though without any details as to the latency or the I/O mix. Then further down it says that in a 50:50 read write 8K IOPS environment you can get 1057 IOPS per EFD at 12ms response time for reads and 24ms response time for writes without any additional help from the clarrion DRAM based write cache, or 1760 IOPS per EFD at 6ms response time for reads and 2ms response time for writes when the write cache is enabled.

I also found another informative post here at gotitsolutions.org

Which shows roughly 1100, 1500, and 2000 IOPS per drive for 100% random writes, 60:40 write read and 40:60 write read performance respectively  without help from DRAM caching. Furthermore, I had a conversation with a colleague who’s opinion I respect, it appears that “FAST Cache does 64K blocks …[which means that EMC] claim 50% more speed overall.”.

Based on the above information, I think it would be reasonable to assume  that a 6+1 EFD RAID group configured as FAST cache would allow for between 12,000 and 20,000 sub 5ms IOPS depending on the configuration and workload. Thats pretty good, but it’s not the “orders of magnitude” faster than spinning disk so often claimed, and nowhere near the performance of array cache.

The benefits of a write mega-cache

A write cache in our hypothetical 1000 user 12 IOPS per user and using 33:63 R:W VDI environment equates to about 30MiB/sec of random write activity or about 108 GiB per hour. a 6+1 RAID group of 146GB EFD drives provides about 822 GiB of usable cache space. If you split this 50:50 between read and write, this works out to about 4  hours of writes before you even begin to need to destage. This is the thing that differentiates a mega-cache from a standard cache is that it can absorb a sufficiently large number of changes to satisfy hours or possibly even entire business days’ worth of I/O. In addition a cache this large is almost certainly going improve the efficiency with which writes can go to the back end raid group. The extent to which is does this is dependent on many different factors. In some edge cases the additional improvement is marginal, in others it could be close to the kinds of efficiencies typically seen in a NetApp FAS array. In theory a 6+1 RAID-5 disk set combined with a large write cache could approach or even exceed the write efficiency of a 6+6 RAID-10 disk set.

The benefits of a read mega-cache

On the read side of the equation, mega-caches in the order of 250GiB+ have the advantage that they are able to store the majority of the active working set, especially in VDI environments where it is not unusual to see it offloading 80+% of the read I/O from the disks. This not only improves the latency of the I/Os  from cache but also those that need to come from disk. The disk improvements come from reduced I/O contention, and the ability to make read-ahead more effective as detection of the read pattern which triggers the read-ahead functions happens while the data is being served from cache. It also allows the read-ahead algorithms to be more aggressive as the potential risks of reading in too much data and flushing out other useful data is mitigated by the much larger read caches.

The Net-Net is that mega-caches can significantly reduce the average latency for disk I/O even in spindle constrained environments, and the ability to handle peak loads is significantly improved.

FlashCache

NetApp really stoked the market for mega-caches when it released the PAM-II, now called flash-cache (I’m kind of sad they changed the name, there were lots of bad PAM puns like “Flash in the PAM” that few if any will now remember) . Unlike SSD/EFD based cache architectures, FlashCache  connects to the storage controller via PCIe, and includes a NetApp created flash translation layer, some dedicated hardware acceleration and uses a driver which is tuned to the characteristics of all of this hardware. All of this results in a cache which is capable of hundreds of thousands of sub 2ms IOPS with shorter code paths and higher levels of CPU efficiency than is seen in SSD/EFD based caches.

Another thing that helps is cache awareness of FAS Deduplication and Flexclones, which in effect multiplies the effective size of the cache by the level of deduplication within the active dataset. For example if you are using deduplication for persistent desktop guest O/S images and seeing 95% deduplicatoin ratios (especially for the 2GB the core operating system portions of the image), your effective cache size is 20x larger. This means that even a modest FAS2040 with 4GB of ram can have an effective read cache of 50+GiB which comes in really handy during boot storms. For a 256GB Flash cache, using the same math, the effective cache size ends up being around 5TB ! Thats a best case situation, but the strange thing about VDI on NetApp is that best case scenarios just keep coming up over and over again which is what prompted me to start on this series of posts in the first place.

Isnt FlashCache just for reads ?

As good as FlashCache is, some commentators have quite correctly pointed out that this cache is read only, which is correct,  but they then go on to make the incorrect conclusion to say that is does nothing for write performance. This might elicit a “Thank you Captain Obvious” from some, but yet again this is one of those things which like the sun revolving around the earth, is simple, understandable, full of common sense, and also happens to be wrong.

Flashcache + Dedup/Flexlclone + Realloc  = High speed write cache.

If you’ve read through this entire series, you’ll might remember the following statement

“Thus we expect to write 336 blocks in 58+16= 74 disk operations. This gives us a write IEF of 454%”

This was on the assumption that the system  was about 80% full and that the best allocation area was about 40% utilised. But what happens when the best allocation are is completely unallocated ? This question was already covered in the following blog post, though it would appear the author decided to take another job outside of NetApp (good luck at CommVault Mike :-) and his NetApp blog may get cleaned up at some later time, so I’ve taken the liberty to take an excerpt from it.

“For demonstration, I configured a single 3 disk NetApp aggregate (2 parity, 1 data) to demonstrate how much random write I/O I could get out of a single 1TB 7200 SATA drive .. The result is over 4600 random write IOPs with an average response time of 0.4ms. “

This was 1 SATA data drive (the other two were parity drives which dont add to write speeds), 4600 random writes IOPS . If you extrapolate this, 4 1TB SATA drives will give you you get about 3TB of usable storage and 18,400 IOPS. Woo Hoo ! 4 SATA drives from NetApp = 7 EFD drives from EMC, game over discussion closed .. right ?

Well, yes, under ideal circumstances, but the world is not a perfect place, and neither are the datacenters which inhabi it, even with VDI, so what might stop this from working outside of the unicorn farm ?

1. Those disks wont stay empty

True, but to be equal to the amount of write cache used in our 7 disk EFD cache (assuming a 50:50 between read and write cache) we could add fill those 4 SATA drives with 2.5TB of data, and still have an equivalent write caching capability

2. That freespace wont stay contiguous

True, but NetApp provides methods to re-arrange the freespace via the reallocate -A command (sometimes called segment cleaning). This option is particularly well suited to VDI environments where large burst writes are fairly typical and where optimising access for sequential reads is not generally considered a high priority.

3. There will be competition for read I/O

True, but single instancing technology and smart caching allows the majority of those reads to be served from cache.

But what about the real world ?

I plan to cover each one of these in some blogs on detailed peformance tuning for NetApp, but rather than delve even deeper into abstract theory, I’m going to pull some data and graphs from an existing 2000+ seat VDI deployment that uses FlashCache and Reallocate to manage some very bursty I/O patterns. The interesting thing about this particular implementation is that it is far from an “ideal” workload ad shows what can be done with a little bit of planning and some really smart storage controllers. In addition with a little luck and some persistence I’ll also pull up a far more modest lab environment and see exactly how much you can wring out of a NetApp controller on a tight budget.

Categories: Uncategorized

Data Storage for VDI – Part 9 – Capex and SAN vs DAS

I’d intended writing about Megacaches in both the previous post, as well as this one, but interesting things keep popping up that need to be dealt with first. This time it’s an article at information week With VDI, Local Disk Is A Thing Of The Past. In it Elias Khnaser outlines the same argument that I was going to make after I’d dealt with the technical details of how NetApp optimises VDI deployments.

I still plan to expand on this with posts on Megacaches, single instancing technologies, and broker integration, but Elias’ post was so well done that I thought it deserved some immediate attention.

If you havent done so already, check out the article before you read more here, because the only point I want to make in this uncharacteristically small post is the following

The capital expenditure for storage in a VDI deployment based on NetApp is lower than one based on direct attached storage.

This is based on the following

Solution 1: VDI Using Local Storage – Cost

$614,400

Solution 2 : VDI Using HDS Midrange SAN – Cost

$860,800, with array costs of approx $400,000

Solution 3 : VDI Using FAS 2040 – Cost

860,000 – 400,000 + (2000 * $50) = $560,000

You save $54,000 (about 10% overall) compared to DAS and still get the benefits of shared storage. That’s $56,000 you can spend on more advanced broker software or possibly a trip to the Bahamas.

Now if you’re wondering where I got my figures from, I did the same sizing exercise I did in Part 7 of this post but using 12 IOPS per user and using 33:63 R:W ratio. I then came up with a configuration and asked one of my colleagues for a street price. The figure came out to around $US50/desktop user for an NFS deployment, which is inline with what NetApp has been saying for about our costs for VDI deployments for some time now.

Even factoring in things like professional services, additional network infrastructure, training etc, you’d still be better off from a up-front expenditure point of view using NetApp than you would with internal disks.

Given the additional OpEx benefits, I wonder why anyone would even consider using DAS, or even for that matter another vendors SAN.

Data Storage for VDI – Part 8 – Misalignment

If you follow NetApp’s best practice documentation all of the stuff I talked about works as well, if not better than outlined at the end of my previous post. Having said that it’s worth repeating that there are some workloads that are very difficult to optimize, and some configurations that don’t allow the optimization algorithms to work, the most prevalant of which is misaligned I/O.

If you follow best practice guidelines (and we all do that now don’t we …) then you’ll be intimately familiar with NetApp’s Best Practices for File System Alignment in Virtual Environments. If on the other hand you’re like pretty much everyone that went to the the Vmware course I attended, then you may be of the opinion that it doesn’t make that much of a difference. I suspect that if you I asked your opinion about whether you should go to the effort to ensure that your guest O/S partions are aligned, your response would probably fall into one of the following categories

  1. Unnecessary
  2. Not Recommended by VMware (They do, but I’ve heard people say this in the past)
  3. Something I should do when I can arrange some downtime during the Christmas holidays
  4. What you talking about Willis ?

If there is one thing I’d like you to take away from this post, it is the incredible importance of aligning your guest operating systems. After the impact of old school backups and virus scans, it’s probably the leading cause of poor performance at the storage layer. This is particularly true if you have minimized the number of spindles in your environment by using single instancing technologies such as FAS deduplication.

Of course this being my blog, I will now go into painful detail to show why it’s so important, if you’re not interested or have already ensured that everything is perfectly aligned, stop reading and wait until I post my next blog entry :-)

Every disk reads and writes its data in fixed block sizes, usually either 512 or 520 bytes which effectively stores 512bytes user data and 8 bytes of checksum data. Furthermore  the storage arrays I’ve worked with that get a decent number of IOPS/spindle all use some multiple of these 512 bytes of user data as the smallest chunk that it stored in cache, usually 4KiB or some multiple thereof. The arrays then perform reads and writes of data using to and from disks using these these chunks along with the appropriate checksum information. This works well because most applications and filesystems on LUNs / VMDKs / VHD’s etc also write in 4K chunks. In a well configured environment, the only time you’ll have a read or more importantly a write request that is not some multiple of 4K is in NAS workloads, where overwrite requests can happen across a range of bytes rather than a range of blocks, but even then it’s a rare occurrence.

Misalignment of I/O though causes a write from a guest to partially write to two different blocks which explained with pretty diagrams in Best Practices for File System Alignment in Virtual Environments, however that document doesnt quite stress how much of a performance impact this can have when compared to niceley aligned workloads, so I’ll spend a bit of time on this here.

When you completely overwrite a block in its entirety, an arrays job is trivially easy,

  1. Accept the block from the client and put it in the one of the write cache’s block buffers
  2. Seek to the block you’re going to write to
  3. Write the block

Net result = 1 seek + 1 logical write operation (plus any RAID overheads)

However when you send an unaligned block, things get much harder for the array

  1. Accept a block worth of data from the client, put some of it in one of the block buffers in the arrays write cache, put the rest of it into the adjacent block buffer. Neither of these block buffers will be completely full however, which is bad.
  2. If you didn’t already have the blocks that are going being overwritten in the read cache, then
    1. Seek to where the two blocks start
    2. read the 2 blocks from the disk to get the parts you don’t know about
    3. Merge the information you just read from disk / read cache with the blocks worth of data you received from the client
    4. Overwrite the two blocks with the data you just merged together

Net result = 1 seek + some additional CPU + double write cache consumption + 2 additional 4K reads, and one additional 4K write (plus any RAID overheads) + inneficient space consumption.

The problem as you’ll see isn’t so much a misaligned write as such, but the partial block writes that it generates. In well configured “Block” environments (FC / iSCSI), you simply won’t ever see a partial write, however in “File” environments (CIFS/NFS) environments, partial writes are a relatively small, but expected part of many workloads. Because FAS arrays are truly unified for both block and file, Data ONTAP has some sophisticated methods of detecting partial writes, holding them in cache, combining them where possible, and committing them to disk as efficiently as possible. Even so, partial writes are really hard to optimize well.

There are many clever ways of optimizing caching alogrithms to mitigate the impact of partial writes, and NetApp combines a number of these in ways that I’m not at liberty to disclose outside of NetApp. We developed  these otptions because a certain amount of bad partial write behavior is expected from workloads targeted at a FAS controller, and much like it is with our kids at home, tolerating a certain amount of “less than wonderful” behavior without making a fuss allows the household to run harmoniously. But this tolerance has its limit and after a point it needs to be pulled into line. While Data ONTAP can’t tell a badly behaved application to sit quietly in the corner an consider how its behavior is affecting others, it can mitigate the impact on partial writes on well behaved applications.

Unfortunately environments that do wholesale P2V migrations of WinXP desktops without going through an alignment exercise, will almost certainly generate large number of misaligned writes. While Data ONTAP does what it can to maintain the highest performance it can under those circumstances, these misaligned writes much harder to optimise, which in turn will probably have a non-trivial impact on the overall performance by multiplying the number of I/O’s  required to meet the workload requirements.

If you do have lots of unaligned I/O in your environment, you’re faced with one of four options.

  1. Use the tools provided by NetApp and others like VisionCore to help you bring things back into alignment
  2. Put in larger caches. Larger caches, especially megacaches such as  FlashCache means the data needed to complete the partial write will already be in memory, or at least on a medim that allows sub millisecond read times for the data required to complete partial writes.
  3. Put in more disks, if you distribute the load amongst more spindles, then the read latency imposed by partial writes will be reduced
  4. Live with the reduced performance and unhappy users until your next major VDI refresh

Of course the best option is to avoid misaligned I/O in the first place by following Best Practices for File System Alignment in Virtual Environments. This really is one friendly manual that is worth following regardless of whether you use NetApp storage or something else.

To summarise – misaligned I/O and partial writes are evil and they must be stopped .

Data Storage for VDI – Part 7 – 1000 heavy users on 18 spindles

The nice thing from my point of view is that because VDI’s steady state performance is characterized by a high percentage of random writes and high concurrency, the performance architecture of Data ONTAP has been well optimized for VDI for quite some time, in fact since before VDI was really  focus for anyone. As my dad said to me once, “Sometimes its better to be lucky than it is to be good” :-)

As proof of this, I used our internal VDI sizing tools for

  • 1000 users
  • 50% Read, 50% Writes
  • 10 IOPS/second
  • 10GB single instanced (using FAS Deduplication) Operating system image
  • 0.5 GB RAM per Guest (used to factor the vSwap requirements)
  • 1 GB of Unique data per user (deliberately low to keep the focus on the number of disks required for IOPS)
  • 20ms read response times
  • WAFL filesystem 90% Full

The sizer came back with needing only 24 FC disks to satisfy the IOPS requirement on our entry level 2040 controller without needing any form of SSD or extra accelerators.

That works out to over 400 IOPS / 15K disk or about 40 users per 15K disk, 400% better than the 10 users per 15K RAID-DP spindle predicted by Ruben’s model. For the 20% Read 80% write example, the numbers are even better with only 18 disks on the FAS-2040 which is 555 IOPs or 55 users per disk vs. the 9 predicted by Rubens model (611% better than predicted). To see how this compares to other SAN arrays, check out the following table which outlines the expected efficiencies from RAID 5, 10, and 6 for VDI workloads.

Read IEF Write IEF Overall Efficiency at 30:70 R/W Overall Efficiency at 50:50 R/W
RAID-5 100% 25% 47.5% 62.5%
RAID-10 100% 50% 65% 75%
RAID-6 100% 17% 41.9% 58.5%
RAID-DP + WAFL 90% Full 100% 200-350% 230% 170%

The really interesting thing about these results is that as the workload becomes more dominated by write traffic, RAID-DP+WAFL gets even greater efficiencies. At a 50:50 workload the write IEF is around 240%, however at 30:70 workload the write IEF is close to 290%. This happens because random reads inevitably cause more disk seeks, whereas writes are pretty much always sequential.

Don’t get me wrong, I think Ruben did outstanding work, and something which I’ve learned a lot from, but when it comes to sizing NetApp storage by I/O I think he was working with some inaccurate or outdated data that led him to some erroneous conclusions which I hope I’ve been able to clarify in this blog.

In my next post, I hope to cover how megacaching techniques such as NetApp’s FlashCache can be used in VDI environments and a few specific configuration tweaks that can be used on a NetApp array to improve the performance of your VDI environment.

Data Storage for VDI – Part 6 – Data ONTAP Improving Read Performance

WAFL,  Metadata Reads and SRAWR

This brings us to reads, WAFL allows us to excel at writes, but what about reads ? I’ve already stated that compared to other RAID configurations RAID-DP is about 13% worse for reads, so what does WAFL do to offset that ? Well to start with, it can actually make things worse (and yes, I still work for NetApp). Why and how does this happen? Well, remember that WAFL is a fine-grained storage virtualisation layer, we map, and can remap the physical whereabouts of every single 4K block. In order to find the block you’re want to read, the array needs to consult this map. Old school traditional SAN array controllers don’t need to do this, they are use an algorithm like base+offset to find the requested block, or they map larger chunks (e.g. 250Kb) and  they pin a much smaller map inside of the array’s cache. Because the WAFL map (the metadata) is relatively large, historically, only a portion of it stays in memory cache. When the active working set is very large, WAFL will probably need to do two back-end disk reads for a majority of the front-end reads, one for metadata and one for the data.

The combination of losing read spindles to dedicated parity drives, and then losing more IO bandwidth to metadata reads can put Data ONTAP at a disadvantage for workloads with a high percentage of random reads. But wait, there’s more ! There’s one more issue which is occasionally thrown at us by our competitors. Sometimes known as “Sequential Reads After Random Writes” or SRARW can be a problem for WAFL (and, I’d imagine other similar data layout engines such as ZFS that use mapping rather than algorithms to locate data). The reason for this is that turning random writes into sequential writes can mean that sequential reads get turned into random reads, and that has a fairly negative impact on sequential read performance.

Now before I go into this in detail, keep in mind that for the vast majority of VDI deployments this is not a problem. The only time people really tend to notice this is during old school bulk data copy style backups and database integrity checks. Having said that there are a number of things NetApp does to mitigate the SRARW effect.

WAFL and Temporal Locality of Reference

Firstly, another way of looking at things is that what WAFL does is to exchange “spatial locality of reference” for “temporal locality of reference”. For example, when you write a file into a filesystem like NTFS, or update a database record, you will typically update the MFT or indexes at the same time. Regardless of the apparent logical layout where the MFT or indexes are stored on different regions within the same LUN, WAFL will place all of these updates close to each other on the disk/disks. Similarly in a VDI deployment a write of a single file to a fragmented windows filesystem might logically be written to multiple locations on its disk, but they will all be stored together close together on the disk on the NetApp array. In VDI and OLTP environments, this is a good thing, because in order to access a file or record, you first access the MFT or index which then points to the data you’re after. Guess what ! because of the fact that all parts of the file and its metadata are all laid out close to each other, there is a very good chance that you won’t need to do a seek+settle to get the heads to the data portions resulting in much improved disk reads. In effect, this allows a FAS array to do inline physical defragmentation of guest.

Readsets

Data ONTAP is able to combine this temporal locality of reference with a little publicized feature called a read-set. A read-set a record of which sets of data are regularly read together and is stored along with the rest of the metadata in WAFL. This provides a level of semantic knowledge about the underlying data that the readadhead algorithm uses to ensure that in most cases, the data has already been requested and read into read cache before VDI client sends down its next read request.

Reallocation

Secondly (and this really applies more to Database environments than it does to VDI, but I’ll include it here for the sake of completeness) there are techniques which completely address the SRARW issue..

1. WAR (woah woah woah,, what is it good for) ..

As it turns out this kind of WAR is good for quite a few workload types because it stands for “write after read”. This feature has been available since Data ONTAP 7.3.1, and when enabled for a volume, it senses when you’ve requested bunch of data that is logically sequential, figures out if it had to do an excessive number of random reads at the back end, and if so, finds a nice clean area to write this stuff out, so that the next time you do the same logically sequential read, it is nicely sequentially layed out in a physical sense. I’ve done some tests in hostile environments (a month of running10+ hours  every day of completely random reads followed by a complete sequential integrity check of an exchange 2007 database), and the WAR option increased the sequential scan time by about 15%. A subsequent scan of the same database took 40% less time (and it probably would have been faster if I hand hit a client CPU bottleneck during the integrity check).

2. Regular reallocation scans.

These are recommended as a default best practice for database LUNs in the Data ONTAP administration guide, though it seems that nobody actually reads this friendly manual, so it still doesn’t seem to be common practice. These scans execute every night, and run a complete reallocation of only the “fragmented” blocks. Based on some experiments I did on a 3040 with 12 spindles, this works at about 100+ GB per hour, so for a 4TB database with a 2% daily change rate, a nightly reallocate would take about an hour. This might seem like an imposition, but if you’ve cut your nightly backup window by 8 hours due to cool snapshots and SnapVault/SnapMirror, then adding back an hour to optimise the performance isn’t a big ask. This also creates some nice clean free-space areas, which keeps the write performance nice and snappy. As a side effect, regular reallocations mean that any disks added to the aggregate will quickly get hot data evenly spread across them thereby improving read performance even more.

It should be noted that these two techniques don’t work with deduplicated volumes. If you believe you will be running a lot of single threaded sequential reads in your VDI environment, you should consider placing those workloads on a volume which does not have deduplication turned on, and possibly use one of the other single instancing technologies such as Vmware View in combination with one of the techniques described above.

As I said before, I’ve included those two points for the sake of completeness, but for VDI environments where the I/O profile is almost completely random, WAFLs default behavior of a data layout based on temporal locality of reference will give you better performance than a layout based on spatial locality of reference as used by traditional arrays.

Thats soooo random

At this stage it might be worthwhile noting that random reads and writes aren’t truly random , they are merely “non sequential”, there are few truly random things outside of the world of mathematics, storage benchmarks, and quantum physics. It is this that allows the fuzzy logic in Data ONTAP’s read-ahead algorithms to do their remarkable work. NetApp spent a lot of time and brainpower on creating and fine-tuning these, and I’m confident that they are unsurpassed by any other storage array. This is where I’d like to extensively quote another section out of Ruben’s excellent article with some additions of my own.

The NTFS filesystem on a Windows client uses 4 kB blocks by default. Luckily, Windows tries to optimize disk requests to some extent by grouping block requests together if, from a file perspective, they are contiguous [which the readset feature in Data ONTAP is built to recognise]. That means it is important that files are defragged [Except in Data ONTAP where WAFL has already stored these logically fragmented files physically close to each other thanks to the magic of temporal locality] ….. Therefore it is best practice to disable defragging completely once the master image is complete [Which might be a concern without the performance optimisations built into Data ONTAP] The same goes for prefetching. Prefetching is a process that puts all files read more frequently in a special cache directory in Windows, so that the reading of these files becomes one contiguous reading stream, minimizing IO and maximizing throughput. But because IOs from a large number of clients makes it totally random from a storage point of view, prefetching files no longer matters and the prefetching process only adds to the IOs once again. So prefetching should also be completely disabled. [however Data ONTAP effectively  and transparently restores this performance enhancement thanks to the way readsets work with Data ONTAPs prefetch/readahead capabilities] If the storage is de-duplicating the disks, moving files around inside those disks will greatly disturb the effectiveness of de-duplication. That is yet another reason to disable features like prefetching and defragging. [not to mention that for the most part, that with Data ONTAP it's completely unncecesary]

Aggregating Disk IOPs

Another thing that helps NetApp is the concept of aggregates which makes it a lot easier to recruit the collective IOPs of all the spindles in an array rather than having IOPs trapped and wasted within small RAID groups, in principal, its’ similar to the closely related concept of wide striping. It also globalises the pool of free blocks which made the write allocator’s job much easier. This combination of readsets, hyper-efficient writes and the ability to recruit a lot of spindles to the read workloads means that for most real world workloads, NetApp is as fast, if not faster than equivalently configured arrays from other vendors which was nicely shown in independently audited industry standard SPC-1 benchmarks .

What you might have heard …

For me though, one of the main proofs of the effectiveness of these techniques is that pretty much every “benchmark” run on our kit by our competitors tries to ensure that none of these features are used. I’ve seen things like using artificial 100% completely random workloads to ensure that readsets cant be used, unrealistically large working sets to ensure the maximum number of metadata reads, and really small aggregates, misaligned I/O and other non best practice configurations to make the write allocators’ job as hard as possible. It’s said that that all is fair in love and IT marketing, but the shenanigans that some vendors get up to discredit Data ONTAP’s performance architecture often goes beyond the bounds of professional conduct.

Moving right along

OK, now I have that off my chest, I can move on to the next part of my blog Data Storage for VDI – Part 7 – 1000 heavy users on 18 spindles where I’ll show how Data ONTAP can help reduce the storage costs for VDI to the point where you can afford to use world class shared storage without the availability and managability compromises involved with DAS and other forms of cut price storage.

Data Storage for VDI – Part 5 – RAID-DP + WAFL The ultimate write accelerator

A lot has been written about WAFL, but for the most part, I still think its widely misunderstood, even by some folks within NetApp. The alignment between the kind of fine grained storage virtualisation you get out of WAFL and other forms of compute and network virtualisation, is sometimes hard to appreciate until you’ve had a chance to really get into the guts of it.

Firstly WAFL means we can write any block to any location, and we use the capability to turn random writes at the front end into sequential writes at the back end. When we have a brand new system, we are able to do  full stripe writes to the underlying RAID groups, and the write coalescing works with perfect efficiency without needing to use much if any write cache. If we have a RAID group consisting of 14 data disks and 2 parity disks (the default setting), then a simple way of looking at our write efficiency starts out like this – 14 writes come in, 16 writes go to the back end,  14:16 or 87.5% efficiency, something that makes RAID-10 look a little sick in comparison.

Of course, the one thing that our competitors seem almost duty bound to point out is that as WAFL’s capacity fills, the ability to do full strip writes diminishes, which is true, but only up to a point. The following graph shows what would happen to this write efficiency advantage as WAFL fills up assuming that the data is uniformly and randomly distributed across the entire RAID set, and that we had no other way of optimizing performance.

The nice thing about this graph is that it is simple, its reasonably intuitive, and it shows our random write performance stays nicely above RAID-10 until we are about 60% of the available capacity of a RAID-10 array with the same number of spindles. Now before the likes of @HPstorageguy have a field day, I’d like to point out that this graph/model, like many other simple and intuitive things, such as the idea that the world is flat and that the sun revolves around us, is wrong or at least misleading. The main reason it is misleading is because it underestimates Data ONTAP’s ability to exploit data usage patterns that happen in the real world.

This next section is pretty deep, you don’t need to understand it, but it does demonstrate how abstracting away a lot of the detail can lead you to bad conclusions. If you’re not that interested, or are time poor and you’re willing to take a leap of faith and believe me when I say that WAFL is able to maintain extremely high write performance even when the array is almost full, jump down to the text under the next graphic, otherwise feel free to read on.

Firstly, Data ONTAP does an excellent job of using allocation areas that are much emptier than the system is on average.  This means that if the system is 80% full then WAFL is typically writing to free space that is, perhaps, 40% full. The RAID system also combines logically separate operations into more efficient physical operations.

Suppose, for example, that in writing data to a 32-block long region on a single disk in the RAID group, we find that there are 4 blocks already allocated that cannot be overwritten.  First, we read those in, this will likely involve fewer than 4 reads, even if the data is not contiguous.  We will issue some smaller number of reads (perhaps only 1) to pick up up the blocks we need and the blocks in between, and then discard the blocks in between (called dummy reads).  When we go to write the data back out, we’ll send all 28 (32-4) blocks down as a single write operation, along with a skip-mask that tells the disk which blocks to skip over.  Thus we will send at most 5 operations (1 write + 4 reads) to this disk, and perhaps as few as 2.   The parity reads will almost certainly combine, as almost any stripe that has an already allocated block will cause us to read parity. So suppose we have to do a write to an area that is 25% allocated.  We will write .75 * 14 * 32 blocks, or 336 blocks.  The writes will be performed in 16 operations (1 for each data disk, 1 for each parity).  On each parity we’ll issue 1 read.  There are expected to be 8 blocks read from each disk, but with dummy reads we expect substantial combining, so lets assume we issue 4 reads per disk (which is very conservative).  There are 4 * 14 + 2 read operations, or 58 read operations.  Thus we expect to write 336 blocks in 58+16= 74 disk operations. This gives us a write IEF of 454%, not the 67% as predicted by the graph above. That is the good news.  However, life is rarely this good, for example, not all random writes are 4K random writes.  If customers start doing 8K random writes, then these 336 blocks are only 168 operations, for 227% efficiency.  Furthermore, there is metadata.  How much metadata is very sharply dependent on the workload.  In worst case situations, WAFL can write about as much metadata as data, this is much higher than real-world, but if we go with that ratio, then 336 blocks becomes 84 operations. This give us pretty much a worst case Write IEF of 113% when almost everything is going against us, which is better even than you’d get from most RAID-0 configurations, and twice as good as RAID-10.

Theory is all well and good, but to see how this works in practice, look at the following graph of  a real world scenario. Here we have a bunch of small aggregates, each with  28 15K disk servicing over 4000 8K IOPS 53:47 read/write ratio (Exchange 2007), with aggregate space utilisation above 80%. The main thing to note on this graph is the latency, during this entire time the write latency (the purple line at the bottom) was flat at about 1ms. Read latency was about 6 ms, except for a slight (1 – 2 ms) increase for read latency across one of the LUNs during a RAID reconstruct (represented by the circled points 1 and 2 on the graph)

I see this across almost every NetApp array on which I’ve had the chance to do a performance analysis. Read latencies are around about the same as a traditional SAN array, but write latency is consistently very low, even on our smallest controllers. In general a NetApp array’s ability to service random write requests is only limited by the rate at which sequential writes can be written to the back end disks which gives us SSD levels of random write performance from good old spinning rust. Ruben may have been gracious by assuming that we were achieving the same kind of write performance from RAID-DP as you might get from a traditional RAID-10 layout, but theory, benchmarks, and real world experience says that RAID-DP + WAFL generally does a lot better than that. In most VDI deployments I’d expect to see much better than 150% Write IEF.

For write intensive workloads like VDI, this is excellent news, but writes are only half (or maybe 70%) of the story, which brings me to my next post Data Storage for VDI – Part 6 – Data ONTAP Improving Read Performance

Data Storage for VDI – Part 4 – The impact of RAID on performance

As I said at the end of my previous blog post

The read and write cache in traditional modular arrays are too small to make any significant difference to the read and write efficiencies of the underlying RAID configuration in VDI deployments

The good thing is that this makes calculating the Overall I/O Efficiency Factor (IEF) for traditional RAID configurations pretty straightforward. The overall IEF will depend on the kind of RAID, and the mixture of reads and writes using the following formula

Overall IEF = (Read% * read IEF) + (Write% * write IEF).

To start, with RAID-5, a single front-end  write IOP requires 4 back-end IOPs, giving a write IEF of 25%. If you had 28 * 15K spindles in a RAID-5 configuration, this means you can only sustain 235 * 28 * 25% = 1645 IOPS at 20ms.

Using Rubens or a 30:70 VDI steady state read:write workload the Overall IEF for RAID-5 would be

(30 * 100%) + (70 * 25%) = 47.5%.

For a 50:50 workload, the Overall IEF would be

(50 * 100%) + (50 * 25%) = 75%

For RAID-10 you sacrifice half of your capacity, but instead of there being 4 IOPS for every 1 front end write there are 2 for an write IEF of 50%.  The write coalescing caching tricks also add benefit to RAID-10 but again, not sufficiently to make any significant effect.

So how about RAID-6, with RAID-6, every front end write I/O  requires 6 IOPS at the back end or an uncached  Write IOE about 17% and a cached Write IOE of about 27%. Reads for non-NetApp RAID-6 based on Reed-Solomon algorithms are yet again, unaffected.

So, what about RAID-DP ? Well, much as I hate to say it, even though it is a form or RAID-6, by itself it has the worst of performance of all the RAID schemes (and yes I do still work for NetApp).

Why ? Because RAID-DP, like RAID-4 uses dedicated parity disks. Given that, by default, one disk in every 8 is dedicated to parity and can’t be used for data reads, both RAID-4, and RAID-DP immediately take a 13% hit on reads. In addition, just like RAID-6 every front end random write IOP can require up to 6 write IOPS at the back end This would mean that NetApp has the same write performance as RAID-6 and 13% worse read performance.

This gives the following results  for overall IEF for the 30:70 read:write usecase

(30 * 87%) + (70 * 17%) = 38.40   (!!)

This is exactly the kind of reasoning our competitors use when explaining our technology to others.

So why would NetApp be insane enough to make RAID-DP the default configuration? How have we succeeded so well in the market place ? Shouldn’t there be a tidal wave of unhappy NetApp customers demanding their money back?

Well there are a few reasons we use RAID-DP as the default configuration for all NetApp arrays. The first is that dedicated parity drives makes RAID reconstructs fast with minimal performance impact. It also makes it trivially easy to add disks to RAID groups non-disruptively. “This might be great for availability, but what about performance ?” I hear you ask. Well I’ve been told that you can mathematically prove that the RAID-DP algorithms are the most efficient possible way of doing dual parity RAID, frankly the math is beyond me, but the CPU consumption by the RAID layer is really minimal. The real magic however happens because RAID-DP is always combined with WAFL.

This isnt a good place to explain everything I know about WAFL, and others have already done it better that I probably can (cf Kostadis’ Blog), but I’ll outline the salient benefits from a performance point in the next post Data Storage for VDI – Part 5 – RAID-DP + WAFL The ultimate write accelerator

Follow

Get every new post delivered to your Inbox.

Join 377 other followers