Why Archive ?
Preamble
During a discussion I had at the SNIA blogfest, I mentioned that I’d written a whitepaper around archiving and I promised that I’d send it on. It took me a while to get around to this, but I finally dug it out from my archive, which is implemented in a similar way to the example policy at the bottom of this post., it only took me about a minute to search and retreive it once I’d started the process of looking for it from a FAS array that was far enough away from me to incur a 60ms RTT latency. Overall I was really happy with the result.
The document was in my archive because I wrote it almost two years ago, since then a number of things have changed, however the fundamental principals have not, I’ll work on updating this when things less busy, probably sometime around January ’11. On a final note, because I wrote this a couple of years ago when my job role was different than it is today, this document is considerably more “salesy” than my usual blog posts, it shouldnt be construed as a change in direction for this blog in general.
Introduction
There are a number of approaches that can be broadly classified as some form of Archiving, including Hierarchical Storage Management (HSM), and Information Lifecycle Management (ILM). All of these approaches aim to improve the IT environment by
- Lowering Overall Storage Costs
- Reducing the backup load and improving restore times
- Improving Application Performance
- Making it easier to find and classify data
The following kinds of claims are common in the marketing material promoted by vendors of Archiving software and hardware
“By using Single Instance Storage, data compression and an ATA-based archival storage system (as opposed to a high performance, Fibre Channel device), the customer was able to reduce storage costs by $160,000 per terabyte of messages during the first three years that the joint EMC / Symantec solution was deployed. These cost savings were just the beginning, as the customers were also able to maintain their current management headcount despite a 20% data growth and the time it took to restore messages was drastically reduced. By archiving the messages and files, the customer was also able to improve electronic discovery retrieval times as all content is searchable by keywords.”
These kinds of results while impressive assume a number of things that are not true in NetApp implementations
- A price difference between primary storage and archive storage of over $US160,000 per TB.
- Backup and restores are performed from tape using bulk data movement methods
- Modest increases in storage capacities require additional headcount
In many NetApp environments, the price difference between the most expensive tier of storage and the least expensive simply does not justify the expense and complexity of implementing an archiving system based on a cost per TB alone.
For file serving environments, many file shares can be stored effectively on what would be traditionally thought of as “Tier-3” storage with high-density SATA drives, RAID-6 and compression / deduplication. This is because unique NetApp technologies such as WAFL and RAID-DP provide the performance and reliability required for many file serving environments. In addition, the use of NetApp SnapVault replication based data protection, for backup and long term retention means that full backups are no longer necessary. The presence or absence of the kinds static data typically moved into archives has little or no impact on the time it takes to perform backups, or make data available in the case of disaster.
Finally, the price per GB and IOPS for NetApp storage has fallen consistently in line with the trend in the industry as a whole. Customers can lower their storage costs by purchasing and implementing storage only as required. NetApp FAS array’s ability to non-disruptively add new storage, or move excess storage capacity and I/O from one volume to the other within an aggregate makes this approach both easy, and practical.
While the benefits of archiving for NetApp based file serving environments may be marginal, archiving still has significant advantages for email environments, particularly Microsoft Exchange. The reasons for this are as follows
- Email is cache “unfriendly” and generally needs many dedicated disk spindles for adequate performance.
- Email messages are not modified after they have been sent/received
- There is a considerable amount of “noise” in email traffic (spam, jokes, social banter etc)
- Small Email Stores are easier to cache, which can significantly improve performance and reduce the hardware requirements for both the email servers and the underlying storage
- Email is more likely to be requested during legal discovery
- Enterprises now consider Email to be a mission critical application and some companies still mandate a tape backup of their email environments for compliance purposes.
Choosing the right Archive Storage
It’s about the application
EMC and NetApp take very different approaches to archive storage, each of which works well in a large number of environments. An excellent discussion on the details of this can be found in the NetApp whitepaper WP-7055-1008 Architectural Considerations for Archive and Compliance Solutions. For most people however, the entire process of archive is driven not at the storage layer, but by the archive applications. These applications do an excellent job of making the underlying functionality of the storage system transparent to the end user, however the user is still exposed to the performance and reliability of the storage underlying the archives.
Speed makes a difference
Centera was designed to be “Faster than Optical” and while it has surpassed this relatively low bar, its performance doesn’t come close to even the slowest NetApp array. This is important, because the amount of data that can be pushed onto the archive layer is determined not just by IT policy, but also by user acceptance and satisfaction with the overall solution. The greater the user acceptance, the more aggressive the archiving can be, which results in lower TCO and faster ROI.
Protecting the Archive
While the archive storage layer needs to be reliable, it should be noted that without the archive application and its associated indexes, the data is completely inaccessible, and may as well be lost. It might be possible to rebuild the indexes and application data from the information in the archive alone, often this process may be unacceptably long. Protecting the archive involves protecting the archive data store, the full text indexes, and the associated databases in a consistent manner at a single point in time.
Migrating from an Existing Solution
Many companies already have archiving solutions in place, but would like to change their underlying storage system to something faster and more reliable. Fortunately archiving applications build the capability to migrate date from one kind back-end storage to another into their software. The following diagrams show how this can be achieved for EmailXtender and DiskXtender to move data from Centera to NetApp.
Some organizations would prefer to completely replace their existing archiving solutions including hardware and software. For these customers NetApp collaborates with organizations such as Procedo (www.procedo.com), to make this process fast and painless.
SnapVault
As mentioned previously, the cost and complexity of traditional archiving infrastructure may not add sufficient value to a NetApp file-serving environment, as many of the problems it solves are already addressed by core NetApp features. This does not mean that some form of storage tiering could not or should not be implemented on FAS to reduce the amount of NetApp primary capacity.
One easy way of doing this is by taking advantage of the flexibility of the built in backup technology. This is an extension of the “archiving” policy used by many customers, where the backup system is used for archive as well. The approach of mixing backup and archive is rightly discouraged by most storage management professionals, the reasons for doing so in traditional tape based backup environments don’t apply.
The reasons for this are
- Snapshot and replication based backups are not affected by capacity as only changed blocks are ever moved or stored
- The backups are immediately available, and can be used for multiple purposes
- Backups are stored on high reliability disk in space efficient manner using both non-duplication and de-duplication techniques
- Files can be easily found via existing user interfaces such as Windows Explorer or external search engines
In general, SnapVault destinations use the highest density SATA drives with the most aggressive space savings policies applied to them. These policies and techniques, which may not be suitable for high performance file sharing environments, provide the lowest cost per TB of any NetApp offering. This combined with the ability to place the SnapVault destination in a remote datacenter may relieve the power, space and cooling requirements of increasingly crowded datacenters.
An example policy
Many companies file archiving requirements are straightforward, and do not justify the detailed capabilities provided by archiving applications. For example, a company might implement the following backup and archive policy
- All files are backed up on a daily basis with daily recovery points kept for 14 days, weekly recovery points will be kept for two months and monthly recovery points kept for seven years.
- Any file that has not been accessed in the last sixty days will be removed from primary storage and will need to be accessed from the archive
This is easily addressed in a SnapVault environment through the use of the following
- Daily backups are transferred from the primary system to the SnapVault repository
- Daily recovery points (snapshots) are kept on both the primary storage system and the SnapVault repository for 14 days
- Weekly recovery points (snapshots) are kept only on the SnapVault repository
- Monthly recovery points (snapshots) are kept only on the SnapVault repository
- A simple shell script/batch file is executed after each successful daily backup which deletes any file from the primary volume that has not been accessed in thirty days
- Users are allocated a drive mapping to their replicated directories on the SnapVault destination.
- Optionally the Primary systems and SnapVault repository may be indexed by an application such as the Kazeon IS1200, or Google enterprise search.
Users then need to be informed that old files will be deleted after thirty days, and that they can access backups of their data, including the files that have been deleted from primary storage by looking through the drive that is mapped to the SnapVault repository, or optionally via the enterprise search engines user access tools.
By removing the files from primary storage, instead of the traditional “stub” approach favoured by many archive vendors, the overall performance of the system will be improved by reducing the metadata load, and users will be able to more easily find active files by having fewer files and directories on the primary systems.
Conclusion
Many Organisations archiving requirements can be met by simply adding additional SATA disk to the current production system replicated via SnapMirror to the current DR system – rather than managing separate archive platforms.
This architecture provides flexibility and scalability over time and reduces management overhead. Tape can also be used for additional backup and longer term storage if required. SnapLock provides the non-modifiable WORM like capability required of an archive without additional hardware (a software licensable feature, see more detail at http://www.netapp.com/us/products/protection-software/snaplock.html ).
Some Thoughts on Bit Rot.
During some recent discussions on Twitter, the subject of disk drive rebuild times for very large drives in excess of 10TB has raised the subject of urecoverable read errors also known as UER, which is sometimes blamed on something called “bit rot” however, two NetApp sponsored studies shows that bit rot is far less of a problem for storage array reliability than many other factors.
The best publically available data on bit rot and it’s impact compared to other causes I’ve found is contained in “A Highly Accurate Method for Assessing Reliability of Redundant Arrays of Inexpensive Disks (RAID) by Jon G. Elerath and Michael Pecht in IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO. 3, MARCH 2009 http://media.netapp.com/documents/rp-0046.pdf”. The following information summarizes and paraphrases the information found in that document.
What Bit rot is and why you should care
Bit rot is a concern for two main reasons, for the home user with no RAID protection, it results in the inconvenience of a lost or corrupted file, or possibly a machine that wont boot, for the enterprise user, bit rot raises the specter, not just of a lost or corrupted file, but of the potential to completely lose an entire RAID group after the failure of a single drive due to the “Media Error on Data Reconstruct” problem. The less catastrophic issue on a enterprise calss array is far less because the additional error detection and correction available through the use of RAID and block level checksums means the chances of bit rot causing the loss or corruption of a file is vanishingly remote.
What I believe most people mean by bit rot, could be more accurately described as latent media errors rather “bit rot” which is more strictly caused by degradation of the magnetic properties of the media.
The reason for this is that most early RAID reliability models assumed that data will remain undestroyed except by “bit rot”. Although it is correct that the magnetic properties of the media can degrade, this failure mechanism is not a significant cause. Data can become corrupted any time the disks are spinning, even when data are not being written to or read from the disk. The failure mechanisms outlined below here are not unknown, but neither are they readily available from HDD manufacturers
Common Causes for losing data
Four common causes for losing data after its been correctly written are “Thermal asperities”, scratches and smears, and corrosion.
- Thermal asperities are instances of high heat for a short durations caused by head-disk contact. This is usually the result of heads hitting small “bumps” created by particles embedded in the media surface during the manufacturing process. The heat generated on a single contact may not be sufficient to thermally erase data but may be sufficient after many contacts.
- Although disk heads are designed to push particles away, but contaminants can still become lodged between the head and disk, hard particles used in the manufacture of an HDD, can cause surface scratches and data erasure any time the disk is rotating.
- Other “soft”materials such as stainless steel can come from assembly tooling. Soft particles tend to smear across the surface of the media, rendering the data unreadable.
- Corrosion, although carefully controlled, can also cause data erasure and may be accelerated by thermal asperity generated heat
Why data is sometimes not there in the first place
A latent defect can also be caused by data that was incorrectly, or incompletely written to the disk in the first place, this can happen, this can happen because of the inherent “Bit Error Rate” or BER, writing to damaged media, or too much lubrication and “high-fly writes”
- The bit error rate (BER) is a statistical measure of the effectiveness of all the electrical, mechanical, magnetic, and firmware control systems working together to write (or read) data. Most bit errors occur on a read command and are corrected, but since written data are rarely checked immediately after writing, bit errors can also occur during writes.
- BER accounts for a fraction of defective data written to the HDD, but a greater source of errors is the magnetic recording media that coats the disks. Writing on scratched, smeared, or pitted media can result in corrupted data. The reasons for scratches and smears where covered earlier, however “pits and voids are caused by particles that were originally embedded in the media during the manufacturing process and subsequently dislodged during the polishing process or field use.
- The final common cause for poorly written data is the “high-fly write.” The heads are aerodynamically designed to have a negative pressure and maintain the small, fixed distance above the disk surface at all times. If the aerodynamics are disturbed, the head can fly too high, resulting in weakly (magnetically) written data that cannot be read. In addition to “wind gusts” inside the disk, all disks have a very thin film of lubricant on them to help protection from head-disk contact. While this lubrication helps mitigate the effects of “thermal asperities”, lubrication build-up on the head can increase the flying height, resulting in weak or incomplete writes.
Where’s my data ?
Finally, all the data may have been written correctly, but the disk may not be able to “find” it, because of damage to special “servo” tracks which help keep the heads correctly aligned to the data on the disk. In some cases, it’s not damage to the servo tracks but wear and tear on the motor and disk head bearings, noise, vibration and other electromechanical errors can cause the head positioning to take too long to lock onto a track which ultimately also causes “latent block errors”
How to protect yourself
There are two main ways of dealing with these kinds of latent block errors, the first is to perform disk scrubs, which is something every reputable array vendor does, the problem is however that as disk sizes get larger and larger, the time taken to perform a full disk scrub can take too long for the protection to be as effective as it should. The other method is to use additional levels of RAID protection such as RAID-6 which allows for higher levels of resiliency and error correction in the event of hitting a latent block error when reconstructing a RAID set. NetApp uses both approaches as studies have shown that the risk of losing data through these kinds of events is thousands of times higher than predicted by most simple “MTBF” failure models.
A Public Apology
This is an edited copy of a comment I posted on storagezilla’s blog (the second of two which said more or less the same thing, the comment moderation there doest seem to indicate whether the comment is posted or queued). There was a secondary request to EMC which dilutes the essence of the apology and on reflection and feedback from other’s I’ve decided to remove it.
You’re right, it was 62 vBlock accounts not 62 vBlock’s sold as I tweeted
“1 Year in and just over 60 V-Blocks sold. I’ll wager that there will be many more FlexPod deployments in 12 months time”
and
“@DanMoz 63 in production or deployment according to the figures I saw. But you are right, the concept has been sold well.”
Thanks for pointing out the inaccuracy of these statements (though inferring that I’m dumb or a liar is a little harsh), and I fully recant/withdraw the comment and apologise for the dumb error, both here, on twitter where I made the statement, and on my own blog. I will be more careful in the future.
<Request removed .. JM >
Let the truth prevail.
Regards John Martin
@life_no_borders
On a third reading of this, even my apology was inaccurate, I claimed 63 vBlocks had been sold, not 62 … d’oh ! Time to get more sleep and up my game.
Some quick thoughts about backup
This is a summary I wrote for someone else, not my usual blog entry, however it does encapsulate my thoughts around the benefits of NetApp’s implementation of replication based backup. I’ll try to get to a more technically focussed version soon.
Exponential increases in data combined with increased storage density techniques means that traditional “Bulk Copy” based method of backup are no longer able to address the growing backup challenges of a modern IT environment. Even backup architectures which are based on an “incremental forever” basis may find that the time it takes to move whole files from remote locations over slow and WAN links hit scalability limits as the amount of data at the remote sites increases.
Ultimately, the most promising technology to resolve this involves replicating only changed data blocks from primary data sources to secondary arrays in other physical locations. These secondary arrays are equipped with high-density low cost disk drives to provide the large amounts of raw capcity in the densest possible footprint. Once the data has been sent to the remote array, the solution can then perform various kinds of data manipulation to store multiple recovery points within a small data storage footprint. This class of technology generally requires that the primary storage is re-hosted on intelligent storage arrays, or that new agents and secondary storage arrays and backup systems are implemented that support this advanced functionality.
This has the advantage that only changed blocks will be moved from primary storage through to the secondary storage on the NearStore. This reduces both the amount of storage that needs to be provisioned, and allows for the data to be sent over low bandwidth high latency network connections. Because of this the secondary copies of the data will can be stored automatically in an offsite location without needing a second two step process as is commonly required with tape backups.
With robust data storage architectures, multiple logical data points, and offsite copies, these technologies can provide almost all the benefits of a tape based backup solution.
Customer Backup Considerations
Proven Scalability
While a number of companies have been changing their traditional backup engines to leverage the benefits of replication based storage, NetApp pioneered this technology with the release of industries first ATA based backup to disk appliance in 2002. Since then NetApp has deployed this technology in thousands of locations world-wide many of which protect critical data estates measured in Petabytes
Centralised Policy Based Protection
The advanced data protection capabilities provided by NetApp requires a correspondingly advanced set of management methodologies and tools to fully exploit the benefits of replication based data protection. Protection manager was designed for replication based backup, and provides an integrated way of managing both backup and disaster recovery in a single pane of glass through the following functions and features
- Discovery. Detects new volumes not protected and presents as “unprotected data” in the Protection Manager UI.
- Policy creation. Creates policies for data protection in a wizard-driven graphical process and then calls lower level NetApp tools for execution of the replication process
- Monitoring. Monitors the whole replication process, watching the capacity and performance against policy, and ensures that protection policies are not out of compliance
- Visualization. Provides discovery and mapping views including drilldown and management by exception
- Reporting. Offers status and health reporting such as a “data transfer report” to identify transfer amount, performance metrics, and duration of transfer for replication processes
- Virtual machine support. Support through Open Systems SnapVault includes VMware ESX, Microsoft Hyper-V, and Citrix XEN
- Application integration. Integrates with SharePoint, SQL Server, Microsoft Exchange, Oracle, and SAP via NetApp SnapManager
- DR task automation. Automates tasks, leverages templates, and provides ongoing monitoring with subsequent reporting to those in authority
- DR readiness. Monitors resources for changes that could compromise a disaster recovery and proactively communicates them to administrators for remediation
- One-button failover. Provides continued data access to users, even in the event of a disaster
Usable Copies
Unlike typical backup applications, snapvault always keeps the data in it’s original usable format that can be accessed by open industry standard protocols and methods. Files can be accessed using CIFS, NFS or HTTP, LUNs can be accessed by iSCSI or Fibre Channel, all without having to restore the data back to the original location (which may destroy good data), or find alternate space to recover the file / data object.
Ease of Restore
Usable copies also provides a self service restore capability that reduces recovery times (RTO), decreases helpdesk calls, and increases end user faith in the backup process. This in turn reduces counterproductive end user driven backup strategies and reduces both infrastructure and business costs. Usable copies also allow backups to be verified for correctness, and provides easy ways of performing deep content searching of backups for legal and other data discovery requests.
Tape Integration
Because the backup data remains in the same format used for traditional primary storage, the high speed NDMP based dump and mirror to tape options used by thousands of companies around the world to protect their NetApp primary storage . These long term archival copies can be sent to tape under the control of traditional backup systems such as NetBackup, TSM or CommVault leveraging existing knowhow and infrastructure, while minimising costs associated with tape management and off siting
Data Storage for ECM – Part 1 – Unified Storage
The last few months have been interesting for me, as my new job role involves a lot of work with alliance partners, many of whom either didn’t know anything about NetApp, or where they did know something it was along the lines of “Oh yeah, the NAS company”. In many respects, it’s a lot easier to explain what we do when someone has an open mind, as pre-conceived notions are often hard to budge, and telling someone they’re wrong is rarely a good way to start a trusted relationship. Even though I report up through our local director of marketing, my soul is still that of an engineer, so when it comes to describing what NetApp does, and why that’s important I tend to go straight to “Well, we still sell NAS, and that’s a big part of what we do, but we really sell is Unified Storage” at which point I expect to see the “and I should care about this because …” look
I’ve been seeing this look quite a bit recently, mostly because many of the people I speak to also get briefs from other storage vendors, and they too have suddenly started talking about “Unified Storage” without really understanding it or explaining its relevance to datacenter transformation. A good case in point was the opportunity I had to speak at the local VMware seminar series where I shared a stage with VMware, Cisco, and EMC. All of us got our 7.5 minutes to explain how we helped accelerate our customers journey to the cloud. VMware went first, followed by EMC, then Cisco and then me..
I’d prepared two slides for my 7 minutes focussing on our key differentiators, Unified storage, tight VMware integration with advanced storage features, deduplication and storage efficiency, Secure Multi-Tenancy, Cisco validated designs, Backup and recovery, and waited happily to see if EMC would come out with their usual pitch.
Boy, was I surprised … EMC’s pitch was Unified storage, deduplication, tight VMware integration with advanced storage features, deduplication and storage efficiency, security and Cisco validated designs … what the ????, had I suddenly slipped into a parallel universe ? Had EMC, a company fairly well known for pushing seven different kinds of storage with forklift upgrades suddenly capitulated and acknowledged that the approach NetApp had been pushing for so many years was actually right ? Was Chuck Hollis about to come on stage and apologise for blatant manipulation of social media and comment filtering ?
Now while I could have picked holes in their story by pointing out that at least from a VMware perspective they don’t have deduplication, that their advanced integration with VAAI hadn’t been released, there was no Cisco Validated Design for vBlock, and that the RSA stuff had no integration at the storage layer, nobody is really interested in hearing vendors denigrate each other, and I only had 7 minutes to figure out how to show our unique ability to help customers in the face of the most shameless “me too” campaign I’ve ever seen. During that 7 minutes there was one thing that really struck me. EMC has no real concept of why unified storage is important. Their concept of unified storage was something that allowed connection by Fibre Channel, iSCSI, CIFS and NFS and had a nice GUI. Having worked at NetApp for a number of years, I was surprised at how they’d missed the point completely. Almost everyone at NetApp knows that these are good features to have (we’ve had them for over 10 years now), but we also know that by themselves, they have only limited benefits. I’ve had a little while now to think about this, and it’s become clear to me that for other vendors, Unified storage is not a strategic direction, but a tactical response to NetApp’s continued success in gaining market share. This becomes even more obvious by taking a look at their storage portfolios
| NetApp | EMC | Dell | HP | IBM | HDS | |
| Entry Level NAS / NAS Gateway | FAS | Iomega | Windows Storage Server | Windows Storage Server | N-Series (OEM) | Windows Storage Server |
| Entry Level SAN / iSCSI | FAS | Celerra NX | Equallogic | MSA
Lefthand |
N-Series (OEM)
DS3xxx (OEM) |
AMS |
| MidRange NAS | FAS | Celerra NS | Celerra (OEM) | Polyserve ? | N-Series (OEM) | BlueArc(OEM) |
| Mid-Range SAN | FAS | CLARiiON | Equallogic
Celerra (OEM) |
Lefthand
EVA 3PAR |
DS4xxx/5xxx (OEM)
N-Series (OEM) XIV |
AMS |
| Archive & Compliance | FAS | Centera | Centera (OEM) | HP RISS | FAS | HCP |
| Backup to disk platform | FAS | DataDomain
Avamar |
HP VTL
HP Sepaton (OEM) |
Diligent VTL | Dilligent (OEM) | |
| Storage Virtualisation Gateway | FAS (V-Series) | Invista
VPLEX |
SVC
N-Series Gateway (OEM) |
USP-V | ||
| Object Repository | StorageGRID | Atmos | StorageGRID (OEM) | StorageGRID (OEM) | HCP | |
| High End / Scale Out | FAS / FAS (C-Mode) | V-Max | V-Max (OEM) | USP-V (OEM) | DS6xxx/8xxx
XIV |
USP-V |
| Mainframe | N/A | V-Max | V-Max (OEM) | USP-V (OEM) | DS8xxx
XIV |
USP-V |
Now, if you match one of those arrays against the workload they were designed for, you’ll probably get a pretty good result. In a static, reasonably predictable environment without much change, you could make a reasonable argument that this was the best approach to take. You built a silo around an application or function, and purchased the equipment that matched that function. I’ve seen more than one customer that had every product in a vendors portfolio, and seem to be fairly happy, or at least have been until fairly recently.
The problem with these narrow silo’ed approaches is that each silo creates new inefficiencies and dedicated areas of whitespace in both capacity, performance. For example, there is no way of taking excess capacity allocated to a backup to disk appliance and start using it for CIFS home directories, nor is there a way of taking the excess IOPS capability of temporarily idle disk archive and allocate those IOPS to another application undergoing an unusual workload spike such as a VDI bootstorm.
But for me, the biggest area of waste is that of management. Each of these silo’s tends to get its own set of administrators and workflows, each of which may, or may not work in harmony with the other. Most of us have experienced the bitterness and waste of IT turf wars, and the traditional vendors not only encourage, but depend on and help maintainin these functionality silos, as it allows a divide and conquer sales model that benefits the vendor far more than the customer. If there was a book entitled “How to build an inflexible and wasteful IT infrastructure”, I imagine that encouraging and spreading “IT functionality silos” would fill up the first few chapters. Even though there are a bunch of people who have been quite happy with this status quo, and the business processes from budgeting and product selection all the way through to procurement and training that entrenches this model, things are changing, and they’re changing a lot faster than I thought they would.
A lot of credit for this change has to go to vendors like Microsoft, Cisco and VMware whose products have blurred the lines of these traditional silos. Virtualisation at both the compute and network layers have driven the kinds of cross functional change CIO’s have been crying out for, and in the wake of this, Unified storage finds its natural fit ; not because of its support multiple protocols, but because these environments require the kind of workload agility and managment simplicity that only a truly Unified storage offering can fully satisfy.
But it’s not just server and desktop virtualisation and other forms of shared infrastructure where unified storage is a natural fit. Almost any “multi-part” or landscape style application can benefit too, not just because of the flexibility and efficiency, but more importantly because of the fact that these environments are really hard to protect effectively. A really good example of this is Enterprise Content Management Applications such as FileNet and SharePoint
Typically these applications have
- Content Servers
- Business Process Workflow Engines / Servers
- Database servers
- Index Servers
- Content Servers
In a large installation, there will be many of these servers and multiple databases, indexes and content repositories to cater for scalability and in some cases, the tyranny of distance (latency is forever).
In a “traditional” silo’d model this data would be stored on two or possibly three different kinds of storage arrays, each with its own method of backup and replication, most of which depend on some form of “bulk copy” backup method as the primary form of logical data protection. The effect of this is that backing up these ECM systems on traditional storage architectures is almost impossible. While I’ve been talking to customers about this for a few years, recently there seems to have been a big increase in customers seeing these problems. In one case a design review for a Petabyte scale SharePoint implementation identified that if a critical index was lost the entire infrastructure would be effectively unrecoverable, and that there was no effective backup capability. In another discussion I had today around redesigning data protection, a brief mention on ECM created more interest than almost anything else simply because of the difficulty of backing their Documentum system.
Truly unified storage not only allows data to be stored using multiple protocols, and provides rich functionality like deduplication and compliant WORM storage which makes it a logical choice for ECM solutions, but more importantly it also provides a single integrated method for protecting that data in a way that is application-consistent without the need for a “cold” backup. And, you guessed it, NetApp can do that with ease, whereas other vendors’ versions of what they are calling unified storage would find that challenging (to be kind).
In the next few posts, I’ll take a deeper dive into exactly what NetApp does for Enterprise Content Management, with a focus on why Unified Storage is such a good match, and what we do to protect a company’s most important data assets.
There but for the grace of god ..
There are few incidents that can truly be called disasters ; things like the World Trade center bombing, the Boxing Day tsunami, and the Ash Wednesday bushfires. Whether or not you think a major failure in IT infrastructure can be called a disaster just like those true tragedies, the recent and very public failure of the EMC storage infrastructure in the state of Virginia is the kind of event none of us should wish on anyone.
While we all like to see the mighty taken down a peg or two, there’s a little too much schadenfreude on this incident from the storage community for my taste. Most of us have had at least one incident in their career that we are all very glad never got the coverage this has, and my heart goes out to everyone involved… “There, but for the grace of God go I … “
I must say though, I’m a little surprised on what appears to be a finger pointing exercise with a focus on operator error, even though it would confirm my belief that Mean Time Before Cock-up (MTBC) is a more important metric than Mean Time Before Failure (MTBF). Based on the nature of the outage and some subsequent reports, it looks like there was a failure of not just one but two memory cache boards in the DMX3. If so I’d have to agree with statements I’ve seen in the press saying the chances of that happening are incredibly unlikely or even unheard of. In a FAS array this would be equivalent to an NVRAM failure followed by a subsequent NVRAM failure during takeover, though even then at worst, the array would be back up with no loss of data consistency within a few hours. Having said that, the chances of either of these kinds of double failure events happening are almost unimaginable, but certainly, as recently shown, not impossible. How ” an operator [using] an out-of-date procedure to execute a routine service operation during a planned outage” could cause that kind of double failure is kind of beyond me, and has changed my opinion on the DMX’s supposedly rock solid architecture.
What I believe happened was not a failure of EMC engineering (which I highly respect), or even a failure of the poor tech who is followed the “outdated procedure” but rather it was a ”failure of imagination”.
In this case the unimaginable happened, an critical component that “never failed”, did fail, something which provides a valuable lesson to everyone who builds, operates and funds mission critical IT infrastructure. Regardless of whether the problem was caused by faulty hardware, tired technicians, or stray meteorites, there really is no substitute for defense in depth. As a customer / integrator that means
- redundant components
- redundant copies of data on both local and remote hardware,
- well rehearsed D/R plans.
It doesnt matter what the MTBF figures say, you have to design on the assumption that something will fail and then do everything within your time, skill and budget to mitigate against that failure. If there are still exposures, then everyone who is at risk from those exposures needs to be aware of them and what risks you’re taking on their behalf. We wouldn’t expect anything less from our doctors, I don’t see why we shouldn’t hold ourselves to that same high standard.
As vendors it’s our responsibility to make the features like snapshots, mirroring, and rapid recovery affordable, and easy to use, and do everything we can to encourage our customers to implement them effectively. From my perspective NetApp does a good job of this, and that’s one of the reasons I like working there.
As more infrastructure gets moved into external clouds, I think its inevitable we’re going to hear a lot more about incidents like this as they become more public in their impact. Practices that were OK in the 1990′s no longer work in large publically hosted infrastructures when many of the old assumptions about deploying infrastructure dont hold true.
Hopefully everyone responsible for this kind of multi-tenant infrastructure is reviewing their deployment to make sure they’re not going to be next week’s front page news.
Data Storage for VDI – Part 10 – Megacaches
Megacaches
More recently a range of products have come to the market that take advantage of the increasing affordability of non volatile memory (particularly SLC Flash), to create caching architectures that change the rules for modular storage (in no particular order)
- PAM-11 / FlashCache
- Sun 7000 Logzilla and Readzilla
- FalconStor Flash SAN accelerator (using flash modules from Violin)
- IBM EasyTier / Something to do with SVC
- EMC FAST Cache
- Atlantis Computing vScaler
- Nimble Storage
- Lots more to come …
While I’d love to go into the details of each of these and compare the features and benefits of each technology, a lack of time and detailed information makes this really hard to do. Also, as a general principal, I don’t think that its wise for an employee of one vendor to make a lot of assertions about another vendors technology. I have enough trouble keeping up with what’s happening at NetApp without trying to gain deep subject matter expertise with, for example, HP or EMC’s technology. Having said that I do think contrasting two different approaches can be useful, so for that reason I’ve decided to deviate from that principal, and will compare as diligently as I’m able NetApp’s FlashCache and EMC FASTCache.
I’ve included FlashCache for obvious reasons, there is already more than a Petabyte of it out there, I’ve been analyzing it for about a year now, and have I access to the engineering documentation,. I chose FAST Cache because being an EMC product means the marketing engine behind it will make it widely known and the engineering will be solid. The market presence and differing approaches of both of these technologies make them a fairly good yardstick against which the other mega-cache technologies will compare themselves..
Part of the reason I took so long to write this post was that I spent a fair time trying to characterise the likely performance benefits of a FASTCache solution, which as a competitor is a fairly dangerous exercise.. I’ve tried to be even handed and fact based when doing this and have disclosed where possible the sources of my information, however if you believe I’ve misrepresented the technology please let me know, this is not about vendor bashing, it’s about establishing what I hope is a fair basis of comparison.
Doing this in an even handed fashion was particularly hard because a lot publically available information is either incomplete, or somewhat contradictory. I know that is an industry wide problem, but this is one area where there seems to be a lot more marketing material than engineering substance. The main sources of my information were blog posts by EMC employees and integrators as well as an official EMC technical report, the details of which, and my takeaways from them are as follows.
How Fast is FAST for VDI ?
Chad Sakac quoted here in relation to the speed of writing data in various raid configuration “(I’ve added SSD with 6000 IOps as commented by Chad Sakac).” while I respect Chads comments to do with EMC’s integration with Vmware, I think he’s might be a little off here, especially given that this comment was made 6 months ago, long before FAST Cache was announced.
Mark Twomey (StorageZilla) says that EFD’s have no additional benefit for writes (I assume this applies mostly to Symmetrix which already does good write optimisation) quoted here where he says “The thing most people don’t understand about Flash is that writes aren’t really all that much faster to a good SSD than they are to a regular disk drive. And thus, predicting where writes are going isn’t an objective of FAST”
or Randy Loeschner who also works for EMC and seems to know his way around a database where he says on his blog “Solid State/Enterprise Flash Drives are similar in Write Performance to 15K Fibre Channel disks, but in READ scenarios are capable of 2500 or more READ IOPs.”
I also checked any available benchmarks and found the an EMC document that contained reasonably useful data, though even that seemed to contradict itself with regard to the number of IOPS you could get out of an EFD. Says that an Enterprise Flash Drive (EFD) can get 2,500 IOPS per drive, though without any details as to the latency or the I/O mix. Then further down it says that in a 50:50 read write 8K IOPS environment you can get 1057 IOPS per EFD at 12ms response time for reads and 24ms response time for writes without any additional help from the clarrion DRAM based write cache, or 1760 IOPS per EFD at 6ms response time for reads and 2ms response time for writes when the write cache is enabled.
I also found another informative post here at gotitsolutions.org
Which shows roughly 1100, 1500, and 2000 IOPS per drive for 100% random writes, 60:40 write read and 40:60 write read performance respectively without help from DRAM caching. Furthermore, I had a conversation with a colleague who’s opinion I respect, it appears that “FAST Cache does 64K blocks …[which means that EMC] claim 50% more speed overall.”.
Based on the above information, I think it would be reasonable to assume that a 6+1 EFD RAID group configured as FAST cache would allow for between 12,000 and 20,000 sub 5ms IOPS depending on the configuration and workload. Thats pretty good, but it’s not the “orders of magnitude” faster than spinning disk so often claimed, and nowhere near the performance of array cache.
The benefits of a write mega-cache
A write cache in our hypothetical 1000 user 12 IOPS per user and using 33:63 R:W VDI environment equates to about 30MiB/sec of random write activity or about 108 GiB per hour. a 6+1 RAID group of 146GB EFD drives provides about 822 GiB of usable cache space. If you split this 50:50 between read and write, this works out to about 4 hours of writes before you even begin to need to destage. This is the thing that differentiates a mega-cache from a standard cache is that it can absorb a sufficiently large number of changes to satisfy hours or possibly even entire business days’ worth of I/O. In addition a cache this large is almost certainly going improve the efficiency with which writes can go to the back end raid group. The extent to which is does this is dependent on many different factors. In some edge cases the additional improvement is marginal, in others it could be close to the kinds of efficiencies typically seen in a NetApp FAS array. In theory a 6+1 RAID-5 disk set combined with a large write cache could approach or even exceed the write efficiency of a 6+6 RAID-10 disk set.
The benefits of a read mega-cache
On the read side of the equation, mega-caches in the order of 250GiB+ have the advantage that they are able to store the majority of the active working set, especially in VDI environments where it is not unusual to see it offloading 80+% of the read I/O from the disks. This not only improves the latency of the I/Os from cache but also those that need to come from disk. The disk improvements come from reduced I/O contention, and the ability to make read-ahead more effective as detection of the read pattern which triggers the read-ahead functions happens while the data is being served from cache. It also allows the read-ahead algorithms to be more aggressive as the potential risks of reading in too much data and flushing out other useful data is mitigated by the much larger read caches.
The Net-Net is that mega-caches can significantly reduce the average latency for disk I/O even in spindle constrained environments, and the ability to handle peak loads is significantly improved.
FlashCache
NetApp really stoked the market for mega-caches when it released the PAM-II, now called flash-cache (I’m kind of sad they changed the name, there were lots of bad PAM puns like “Flash in the PAM” that few if any will now remember) . Unlike SSD/EFD based cache architectures, FlashCache connects to the storage controller via PCIe, and includes a NetApp created flash translation layer, some dedicated hardware acceleration and uses a driver which is tuned to the characteristics of all of this hardware. All of this results in a cache which is capable of hundreds of thousands of sub 2ms IOPS with shorter code paths and higher levels of CPU efficiency than is seen in SSD/EFD based caches.
Another thing that helps is cache awareness of FAS Deduplication and Flexclones, which in effect multiplies the effective size of the cache by the level of deduplication within the active dataset. For example if you are using deduplication for persistent desktop guest O/S images and seeing 95% deduplicatoin ratios (especially for the 2GB the core operating system portions of the image), your effective cache size is 20x larger. This means that even a modest FAS2040 with 4GB of ram can have an effective read cache of 50+GiB which comes in really handy during boot storms. For a 256GB Flash cache, using the same math, the effective cache size ends up being around 5TB ! Thats a best case situation, but the strange thing about VDI on NetApp is that best case scenarios just keep coming up over and over again which is what prompted me to start on this series of posts in the first place.
Isnt FlashCache just for reads ?
As good as FlashCache is, some commentators have quite correctly pointed out that this cache is read only, which is correct, but they then go on to make the incorrect conclusion to say that is does nothing for write performance. This might elicit a “Thank you Captain Obvious” from some, but yet again this is one of those things which like the sun revolving around the earth, is simple, understandable, full of common sense, and also happens to be wrong.
Flashcache + Dedup/Flexlclone + Realloc = High speed write cache.
If you’ve read through this entire series, you’ll might remember the following statement
“Thus we expect to write 336 blocks in 58+16= 74 disk operations. This gives us a write IEF of 454%”
This was on the assumption that the system was about 80% full and that the best allocation area was about 40% utilised. But what happens when the best allocation are is completely unallocated ? This question was already covered in the following blog post, though it would appear the author decided to take another job outside of NetApp (good luck at CommVault Mike
and his NetApp blog may get cleaned up at some later time, so I’ve taken the liberty to take an excerpt from it.
“For demonstration, I configured a single 3 disk NetApp aggregate (2 parity, 1 data) to demonstrate how much random write I/O I could get out of a single 1TB 7200 SATA drive .. The result is over 4600 random write IOPs with an average response time of 0.4ms. “
This was 1 SATA data drive (the other two were parity drives which dont add to write speeds), 4600 random writes IOPS . If you extrapolate this, 4 1TB SATA drives will give you you get about 3TB of usable storage and 18,400 IOPS. Woo Hoo ! 4 SATA drives from NetApp = 7 EFD drives from EMC, game over discussion closed .. right ?
Well, yes, under ideal circumstances, but the world is not a perfect place, and neither are the datacenters which inhabi it, even with VDI, so what might stop this from working outside of the unicorn farm ?
1. Those disks wont stay empty
True, but to be equal to the amount of write cache used in our 7 disk EFD cache (assuming a 50:50 between read and write cache) we could add fill those 4 SATA drives with 2.5TB of data, and still have an equivalent write caching capability
2. That freespace wont stay contiguous
True, but NetApp provides methods to re-arrange the freespace via the reallocate -A command (sometimes called segment cleaning). This option is particularly well suited to VDI environments where large burst writes are fairly typical and where optimising access for sequential reads is not generally considered a high priority.
3. There will be competition for read I/O
True, but single instancing technology and smart caching allows the majority of those reads to be served from cache.
But what about the real world ?
I plan to cover each one of these in some blogs on detailed peformance tuning for NetApp, but rather than delve even deeper into abstract theory, I’m going to pull some data and graphs from an existing 2000+ seat VDI deployment that uses FlashCache and Reallocate to manage some very bursty I/O patterns. The interesting thing about this particular implementation is that it is far from an “ideal” workload ad shows what can be done with a little bit of planning and some really smart storage controllers. In addition with a little luck and some persistence I’ll also pull up a far more modest lab environment and see exactly how much you can wring out of a NetApp controller on a tight budget.












