Check out my latest blog post at http://hubs.ly/H02PJ4r0
I have been watching the multi-site distributed NAS space for some years now. There have been some interesting products including Netapp’s Flexcache which looked nice but never really seemed to get market traction, and similarly IBM Global Active Cloud Engine (Panache) which was released as a feature of SONAS and Storwize V7000 Unified. Microsoft have played on the edge of this field more successfully with DFS Replication although that does not handle locking. Other technologies that encroach on this space are Microsoft Sharepoint and also WAN acceleration technologies like Microsoft Branchcache and Riverbed.
What none of these have been very good at however is solving the problem of distributed collaborative authoring of large complex multi-layered documents with high performance and sturdy locking. For example cross-referenced CAD drawings.
It’s no surprise that the founders of Panzura came from a networking background (Aruba, Alteon) since the issues to be solved are those that are introduced by the network. Panzura is a global file system tuned for CAD files and it’s not unusual to see Panzura sites experience file load times less than one tenth or sometimes even one hundredth of what they were prior to Panzura being deployed.
Rather than just provide efficient file locking however, Panzura has taken the concept to the Cloud, so that while caching appliances can be deployed to each work site, the main data repository can be in Amazon S3 or Azure for example. Panzura now claims to be the only proven global file locking solution that solves cross-site collaboration issues of applications like Revit, AutoCAD, Civil3D, and Bentley MicroStation as well as SOLIDWORKS CAD and Siemens NX PLM applications. The problems of collaboration in these environments are well-known to CAD users.
Panzura has been growing rapidly, with 400% revenue growth in 2013 and they have just come off another record quarter and a record year for 2014. Back in 2013 they decided to focus their energies on the Architectural, Engineering & Construction (so-called AEC) markets since that was where the technology delivered the greatest return on customer investment. In that space they have been growing more than 1000% per year.
ViFX recently successfully supplied Panzura to an international engineering company based in New Zealand. If you have problems with shared CAD file locking, please contact ViFX to see how we can solve the problem using Panzura.
I recall Tom West (Chief Scientist at Data General, and star of Soul of a New Machine) once saying to me when he visited New Zealand that there was an old saying “Hardware lasts three years, Operating Systems last 20 years, but applications can go on forever.”
Over the years I have known many application developers and several development managers, and one thing that they seem to agree on is that it is almost impossible to maintain good code structure inside an app over a period of many years. The pressures of deadlines for features, changes in market, fashion and the way people use applications, the occasional weak programmer, and the occasional weak dev manager, or temporary lapse in discipline due to other pressures all contribute to fragmentation over time. It is generally by this slow attrition that apps end up being full of structural compromises and the occasional corner that is complete spaghetti.
I am sure there are exceptions, and there can be periodic rebuilds that improve things, but rebuilds are expensive.
If I think about the OS layer, I recall Data General rebuilding much of their DG/UX UNIX kernel to make it more structured because they considered the System V code to be pretty loose. Similarly IBM rebuilt UNIX into a more structured AIX kernel around the same time, and Digital UNIX (OSF/1) was also a rebuild based on Mach. Ironically HPUX eventually won out over Digital UNIX after the merger, with HPUX rumoured to be the much less structured product, a choice that I’m told has slowed a lot of ongoing development. Microsoft rebuilt Windows as NT and Apple rebuilt Mac OS to base it on the Mach kernel.
So where am I heading with this?
Well I have discussed this topic with a couple of people in recent times in relation to storage operating systems. If I line up some storage OS’s and their approximate date of original release you’ll see what I mean:
|Netapp Data ONTAP||1992||22 years|
|EMC VNX / CLARiiON||1993||21 years|
|IBM DS8000 (assuming ESS code base)||1999||15 years|
|HP 3PAR||2002||12 years|
|IBM Storwize||2003||11 years|
|IBM XIV / Nextra||2006||8 years|
|Nimble Storage||2010||4 years|
I’m not trying to suggest that this is a line-up in reverse order of quality, and no doubt some vendors might claim rebuilds or superb structural discipline, but knowing what I know about software development, the age of the original code is certainly a point of interest.
With the current market disruption in storage, cost pressures are bound to take their toll on development quality, and the problem is amplified if vendors try to save money by out-sourcing development to non-integrated teams in low-cost countries (e.g. build your GUI in Romania, or your iSCSI module in India).
WAN optimization is not something that storage vendors traditionally put into their storage controllers. Storage replication traffic has to fend for itself out in the WAN world, and replication performance will usually suffer unless there are specific WAN optimization devices installed in the network.
For example, Netapp recommends Cisco WAAS as:
“an application acceleration and WAN optimization solution that allows storage managers to dramatically improve NetApp SnapMirror performance over the WAN.”
“…the rated throughput of high-bandwidth links cannot be fully utilized due to TCP behavior under conditions of high latency and high packet loss.”
EMC similarly endorses a range of WAN optimization products including those from Riverbed and Silver Peak.
Back in July, an IBM redpaper entitled “IBM Storwize V7000 and SANSlide Implementation” slipped quietly onto the IBM redbooks site. The redpaper tells us that:
“this combination of SANSlide and the Storwize V7000 system provides a powerful solution for clients who require efficient, IP-based replication over long distances.“
Bridegworks SANSlide provides WAN optimization, delivering much higher throughput on medium to high latency IP networks. This graph is from the redpaper:
Bridgeworks also advises that:
“On the commercial front the company is expanding its presence with OEM partners and building a network of distributors and value-added partners both in its home market and around the world.“
Anyone interested in replication using any of the Storwize family (including SVC) should probably check out the redpaper, even if only as a little background reading.
IBM has just published an SPC-1 benchmark result for XIV. The magic number is 180,020 low latency IOPS in a single rack. This part of my blog post was delayed by my waiting for the official SPC-1 published document so I could focus in on an aspect of SPC-1 that I find particularly interesting.
XIV has always been a work horse rather than a race horse, being fast enough, and beating other traditional systems by never going out of tune, but 180,020 is still a lot of IOPS in a single rack.
SPC-1 has been criticised occasionally as being a drive-centric benchmark, but it’s actually more true to observe that many modern disk systems are drive-centric (XIV is obviously not one of those). Things do change and there was a time in the early 2000’s when, as I recall, most disk systems were controller-bound, and as systems continue to evolve I would expect SPC-1 to continue to expose some architectural quirks, and some vendors will continue to avoid SPC-1 so that their quirks are not exposed.
For example, as some vendors try to scale their architectures, keeping latency low becomes a challenge, and SPC-1 reports give us a lot more detail than just the topline IOPS number if we care to look.
The SPC-1 rules allow average response times up to 30 milliseconds, but generally I would plan real-world solutions around an upper limit of 10 milliseconds average, and for tier1 systems you might sometimes even want to design for 5 milliseconds.
I find read latency interesting because not only does SPC-1 allow for a lot of variance, but different architectures do seem to give very different results. Write latency on the other hand seems to stay universally low right up until the end. Let’s use the SPC-1 reports to look at how some of these systems stack up to my 5 millisecond average read latency test:
DS8870 – this is my baseline as a low-latency, high-performance system
- 1,536 x 15KRPM drives RAID10 in four frames
- 451,000 SPC-1 IOPS
- Read latency hits 4.27 milliseconds at 361,000 IOPS
HP 3PAR V800
- 1,920 x 15KRPM drives RAID10 in seven frames [sorry for reporting this initially as 3,840 – I was counting the drives and also the drive support package for the same number of drives]
- 450,000 SPC-1 IOPS
- Average read latency hits 4.23 millsconds at only 45,000 IOPS
Pausing for a moment to compare DS8870 with 3PAR V800 you’d have to say DS8870 is clearly in a different class when it comes to read latency.
- 1,152 x 15KRPM drives RAID10 in four frames
- 270,000 SPC-1 IOPS
- Average read latency hits 3.76 ms at only 27,000 IOPS and is well above 5 ms at 135,000
- 608 x 15KRPM drives RAID10 in two frames
- 181,000 SPC-1 IOPS
- Average read latency hits 3.72 ms at only 91,000 IOPS and is above 5 ms at 145,000
- 2 x 512GB Flash Cache
- 120 x 15KRPM drives RAID-DP in a single frame
- 68,034 SPC-1 IOPS
- Average read latency hits 2.73 ms at 34,000 IOPS and is well over 6 ms at 54,000
So how does XIV stack up?
- 15 x 400GB Flash Cache
- 180 x 7200RPM drives RAID-X in a single frame
- 180,020 SPC-1 IOPS
- Average read latency hits 4.08 millseconds at 144,000 IOPS
And while I know that there are many ways to analyse and measure the value of things, it is interesting that the two large IBM disk systems seem to be the only ones that can keep read latency down below 5 ms when they are heavily loaded.
[SPC-1 capacity data removed on 130612 as it wasn’t adding anything, just clutter]
Update 130617: I have just found another comment from HP in my spam filter, pointing out that the DS8870 had 1,536 drives not 1,296. I will have to remember not to write in a such a rush next time. This post was really just an add-on to the more important first half of the post on the new XIV features, and was intended to celebrate the long-awaited SPC-1 result from the XIV team.
This post started life earlier this year as a post on the death of RAID-5 being signaled by the arrival of 3TB drives. The point being that you can’t afford to be exposed to a second drive failure for 2 or 3 whole days especially given the stress those drives are under during that rebuild period.
But the more I thought about RAID rebuild times the more I realized how little I actually knew about it and how little most other people know about it. I realized that what I knew was based a little too much on snippets of data, unreliable sources and too many assumptions and extrapolations. Everybody thinks they know something about disk rebuilds, but most people don’t really know much about it at all and thinking you know something is worse than knowing you don’t.
In reading this so far it started to remind me of an old Abbot and Costello sketch.
Anyway you’d think that the folks who should know the real answers might be operational IT staff who watch rebuilds nervously to make sure their systems stay up, and maybe vendor lab staff who you would think might get the time and resources to test these things, but I have found it surprisingly hard to find any systematic information.
I plan to add to this post as information comes to hand (new content in green) but let’s examine what I have been able to find so far:
1. The IBM N Series MS Exchange 2007 best practices whitepaper mentions a RAID-DP (RAID6) rebuild of a 146GB 15KRPM drive in a 14+2 array taking 90 minutes (best case).
Netapp points out that there are many variables to consider, including the setting of raid.reconstruct.perf_impact at either low, medium or high, and they warn that a single reconstruction effectively doubles the I/O occurring on the stack/loop, which becomes a problem when the baseline workload is more than 50%.
Netapp also says that rebuild times of 10-15 hours are normal for 500GB drives, and 10-30 hours for 1TB drives.
2. The IBM DS5000 Redpiece “Considerations for RAID-6 Availability and Format/Rebuild Performance on the DS5000” shows the following results for array rebuild times on 300GB drives as the arrays get bigger:
I’m not sure how we project this onto larger drive sizes without more lab data. In these two examples there was little difference between N Series 14+2 146GB and DS5000 14+2 300GB, but common belief is that rebuild times rise proportionally to drive size. The 2008 Hitachi whitepaper “Why Growing Businesses Need RAID 6 Storage” however, mentions a minimum of 24 hours for a rebuild of an array with just 11 x 1TB drives in it on an otherwise idle disk system.
What both IBM and Netapp seem to advise is that rebuild time is fairly flat until you get above 16 drives, although Netapp seems to be increasingly comfortable with larger RAID sets as well.
3. A 2008 post from Tony Pearson suggests that “In a typical RAID environment, say 7+P RAID-5, you might have to read 7 drives to rebuild one drive, and in the case of a 14+2 RAID-6, reading 15 drives to rebuild one drive. It turns out the performance bottleneck is the one drive to write, and today’s systems can rebuild faster Fibre Channel (FC) drives at about 50-55 MB/sec, and slower ATA disk at around 40-42 MB/sec. At these rates, a 750GB SATA rebuild would take at least 5 hours.”
Extrapolating from that would suggest that a RAID5 1TB rebuild is going to take at least 9 hours, 2TB 18 hours, and 3TB 27 hours. The Hitachi whitepaper figure seems to be a high outlier, perhaps dependent on something specific to the Hitachi USP architecture.
Tony does point out that his explanation is a deliberate over-simplification for the purposes of accessibility, perhaps that’s why it doesn’t explain why there might be step increases in drive rebuild times at 8 and 16 drives.
4. The IBM DS8000 Performance Monitoring and Tuning redbook states “RAID 6 rebuild times are close to RAID 5 rebuild times (for the same size disk drive modules (DDMs)), because rebuild times are primarily limited by the achievable write throughput to the spare disk during data reconstruction.” and also “For array rebuilds, RAID 5, RAID 6, and RAID 10 require approximately the same elapsed time, although RAID 5 and RAID 6 require significantly more disk operations and therefore are more likely to impact other disk activity on the same disk array.”
The below image just came to hand. It shows how the new predictive rebuilds feature on DS8000 can reduce rebuild times. Netapp do a similar thing I believe. Interesting that it does show a much higher rebuild rate than the 50MB/sec that is usually talked about.
5. The EMC whitepaper “The Effect of Priorities on LUN Management Operations” focuses on the effect of assigned priority as one would expect, but is nonetheless very useful in helping to understanding generic rebuild times (although it does contain a strange assertion that SATA drives rebuild faster than 10KRPM drives, which I assume must be a tranposition error). Anyway, the doc broadly reinforces the data from IBM and Netapp, including this table.
This seems to show that increase in rebuild times is more linear as the RAID sets get bigger, as compared to IBM’s data which showed steps at 8 and 16. One person with CX4 experience reported to me that you’d be lucky to get close to 30MB/sec on a RAID5 rebuild on a typical working system and when a vault drive is rebuilding with priority set to ASAP not much else gets done on the system at all. It remains unclear to me how much of the vendor variation I am seeing is due to reporting differences and detail levels versus architectural differences.
6. IBM SONAS 1.3 reports a rebuild time of only 9.8 hours for a 3TB drive RAID6 8+2 on an idle system, and 6.1 hours on a 2TB drive (down from 12 hours in SONAS 1.2). This change from 12 hours down to 6.1 comes simply from a code update, so I guess this highlights that not all constraints on rebuild are physical or vendor-generic.
7. March 2012: I just found this pic from the IBM Advanced Technical Skills team in the US. This gives me the clearest measure yet of rebuild times on IBM’s Storwize V7000. Immediately obvious is that the Nearline drive rebuild times stretch out a lot when the target rebuild rate is limited so as to reduce host I/O impact, but the SAS and SSD drive rebuild times are pretty impressive. The table also came with an comment estimating that 600GB SAS drives would take twice the rebuild time of the 300GB SAS drives shown.
In 2006 Hu Yoshida posted that “it is time to replace 20 year old RAID architectures with something that does not impact I/O as much as it does today with our larger capacity disks. This is a challenge for our developers and researchers in Hitachi.”
I haven’t seen any sign of that from Hitachi, but IBM’s XIV RAID-X system is perhaps the kind of thing he was contemplating. RAID-X achieves re-protection rates of more than 1TB of actual data per hour and there is no real reason why other disk systems couldn’t implement the scattered RAID-X approach that XIV uses to bring a large number of drives into play on data rebuilds, where protection is about making another copy of data blocks as quickly as possible, not about drive substitution.
So that’s about as much as I know about RAID rebuilds. Please feel free to send me your own rebuild experiences and measurements if you have any.
SAN Volume Controller
Late in 2010, Netapp quietly announced they were not planning to support V Series (and by extension IBM N Series NAS Gateways) to be used with any recent version of IBM’s SAN Volume Controller.
This was discussed more fully on the Netapp communities forum (you’ll need to create a login) and the reason given was insufficient sales revenue to justify on-going support.
This is to some extent generically true for all N Series NAS gateways. For example, if all you need is basic CIFS access to your disk storage, most of the spend still goes on the disk and the SVC licensing, not on the N Series gateway. This is partly a result of the way Netapp prices their systems – the package of the head units and base software (including the first protocol) is relatively cheap, while the drives and optional software features are relatively expensive.
Netapp however did not withdraw support for V Series NAS gateways on XIV or DS8000, and nor do they seem to have any intention to, as best I can tell, considering that support to be core capability for V Series NAS Gateways.
I also note that Netapp occasionally tries to position V Series gateways as a kind of SVC-lite, to virtualize other disk systems for block I/O access.
Anyway, it was interesting that what IBM announced was a little different to what Netapp announced “NetApp & N Series Gateway support is available with SVC 6.2.x for selected configurations via RPQ [case-by-case lab approval] only”
What made this all a bit trickier was IBM’s announcement of the Storwize V7000 as its new premier midrange disk system.
Soon after on the Netapp communities forum it was stated that there was a “joint decision” between Netapp and IBM that there would be no V Series NAS gateway support and no PVRs [Netapp one-off lab support] for Storwize V7000 either.
Now the Storwize V7000 disk system, which is projected to have sold close to 5,000 systems in its first 12 months, shares the same code-base and features as SVC (including the ability to virtualize other disk systems). So think about that for a moment, that’s two products and only one set of testing and interface support – that sounds like the support ROI just improved, so maybe you’d think that the original ROI objection might have faded away at this point? It appears not.
Anyway, once again, what IBM announced was a little different to the Netapp statement “NetApp & N Series Gateway support is available with IBM Storwize V7000 6.2.x for selected configurations via RPQ only“.
Whither from here?
The good news is that IBM’s SONAS gateways support XIV
and SVC (and other storage behind SVC) and SONAS delivers some great features that N Series doesn’t have (such as file-based ILM to disk or tape tiers) so SVC is pretty well catered for when it comes to NAS gateway funtionality.
When it comes to Storwize V7000 the solution is a bit trickier. SONAS is a scale-out system designed to cater for 100’s of TBs up to 14 PBs. That’s not an ideal fit for the midrange Storwize V7000 market. So the Netapp gateway/V-series announcement has created potential difficulties for IBM’s midrange NAS gateway portfolio… hence the title of this blog post.