XIV Drive Management

One issue that urgently needs accuracy and clarity is the disk management technology behind XIV.

First some background on XIV:

XIV has a unique architecture in that it is a grid of 15 nodes, each node has its own 8GB cache and 12 drives. In fact XIV is a distributed cache grid, and as Moshe says, don’t get hung up on the drives, they are just containers. If you picture the modern concept of a tiered storage system that uses SSDs to automatically house hot segments (e.g. extents, sub-luns, blocks) and SATA to hold data that isn’t hot, then you might be starting to get an understanding of how XIV gets its performance. It doesn’t use SSDs, it uses RAM instead.

The concept of cache is not new, it was Moshe who is often credited with figuring out how to make large disk systems benefit from cache when he was back at EMC in the early 90’s. As it happens I was then at Data General where we had a new product called HADA (High Availability Disk Array) which won awards in 1991 and was later enhanced and renamed CLARiiON about ’94 (Data General was purchased by EMC in August ’99 and I spent 4 months as an EMC employee before moving to IBM).

The problem with cache in recent years is that big centralized caches provide diminishing performance returns. Smart management of a big shared cache can have a lot of system overhead. XIV sidesteps this by using 15 x 8GB distributed caches, each cache driven by 4 Intel cores and servicing 12 drives.

Unplanned outages are not that common on traditional disk systems, but when they do occur, most seem to me to be caused by one HA component taking out another. Let’s consider two examples:

  1. A drive fails in a 8 drive RAID set, a hot spare kicks in (which has not been used for some time) and the rebuild stress of 8 hours of intensive rebuild I/O causes a second drive failure or sometimes a controller failure or hang.
  2. A controller fails and somehow it takes down the other controller on a dual controller disk system. A colleague of mine was told by a customer that he was incensed when a vendor employee told him this was impossible and the two controllers failing at the same time was just a coincidence.

By comparison, let us consider the issues involved in XIV’s approach to protecting data in the event of drive or controller failure:

What if drives almost never failed?

How can this be, given that enterprise SATA has only 75% of the MTBF of enterprise FC/SAS? The question depends on what we mean when we say ‘fail’. I will distinguish between soft-fail (an administrative action taken by the XIV to ensure data protection) and hard-fail which is when a drive breaks and forces the XIV to react. What the XIV architecture seeks to do is to all but eliminate the hard fail.

XIV uses background scrubbing to detect potential drive block media errors and uses a relatively small number of media errors as a trigger for a soft-fail of that drive. Other vendors, if they are using scrubbing, generally use a much higher (perhaps 10 times higher) number of media errors before they trigger a rebuild, so they are less pro-active. Another one of the differences is that an administrative fail by XIV allows the data to be copied off that drive (effectively creating a third copy of the data), whereas a traditional array rebuild does not make use of the drive in question but re-creates the data from parity information. XIV’s approach is much faster.

XIV also uses S.M.A.R.T. to detect temperature, spin-up time, response time etc. SMART data on an XIV system is very linear because there are no hotspots, so it allows good decision making about drive health. In general SMART is not so smart – analysis of SMART data is all about trend analysis and when you have unpredictable I/O loads on a drive, SMART analysis really struggles to come to useful conclusions about drive health.  One of the problems that plagues other vendors is false-positives because of erratic drive loads/hotspots, which leads them to ignore some SMART indications or set higher thresholds, thereby undermining the whole value of SMART. This is a widely acknowledged industry problem. But because XIV is so steady in its workload distribution, it isn’t so affected by this trend-analysis problem or by false positives.

What if re-protecting after a failure took almost no time?

Re-protection time for the data on a pre-failed drive is typically 20-30 minutes. As contrasted with several hours (sometimes days) on a typical RAID5/RAID6 FC RAID set on a traditional disk system.

What if the system could sustain multiple drive failures within a 30 minute window?

We know that XIV aggressively pre-fails drives so most re-protection incidents are the result of an XIV-initiated soft-fail. Sudden hard fails do exist, but they are relatively rare.

XIV multi-drive failure resilience works like this. If a second drive should show pre-fail signs during the 20-30 minute window, XIV will ignore it for the moment until it has finished dealing with the first drive. XIV can cope with multiple drives pre-failing simultaneously in this scenario. If the second drive is not a pre-fail, but a sudden hard-fail then XIV will leave the re-protection of the soft-failed drive (creation of third copies of data) for the moment and deal with the sudden-hard-failed drive and then go back and finish protecting the data on the soft-failed drive. Again, XIV can cope with many drives failing simultaneously as long as only one of them is a sudden hard fail, and even sudden hard fail drives can sometimes be restarted. Hitachi report that 80% of failed drives returned from from the field are “No Defect Found” and many of the other 20% were physically damaged by the end user (e.g. dropped). XIV support can dial in and spin up a drive if that should ever be required, but remember that there are a couple of thousand XIVs out there and none has ever experienced double disk failure.

Also two drives failing in the same module, or two drives failing across any of the six I/O modules will not cause XIV any problem at all.  XIVs intelligent drive management keeps the data protected.

There are many comments out on the blogosphere and in competitive vendor presentations that accuse XIV of not being able to cope with two drives failing. These opinions are generally fairly speculative, but some seem to be deliberately sowing FUD.

What if re-protecting put almost no stress on the system?

Because the data that needs re-protecting is spread across 14 other data modules each with 8GB distributed cache and 12 drives, the XIV is reading from that number of modules and writing to that number of modules – all drives and caches are participating in the re-protection workload, unlike a traditional system that has one 8 drive RAID set being hammered for hours, and rebuilding onto a single drive that has been sitting idle for weeks or months. Rebuild stress on traditional systems is a major factor – the MTBF of any drive plummets if it is hammered for hours on end – this is when traditional systems are vulnerable to double drive failure or other related stress problems. So if XIV can avoid drive hammering, then even its SATA drives will be more reliable than a competing architecture which hammers FC drives during rebuilds.

What if there were six completely physically independent controllers?

The XIV has six I/O nodes that talk to each other over multiple GigE connections. There is no shared backplane. When a ‘controller’ fails (a node on the grid) there are another 5 ‘controllers’ still running, so instead of slowing by 50% as you do with most traditional systems, you only slow by perhaps 16%. I am told it takes 4-5 hours for a full node re-protection, after which you are running protected again – compare that to a traditional dual controller system where you are running at only 50% performance and completely exposed to a second failure until the engineer turns up with the parts to swap the failed controller. In some cases the XIV can sustain three or more node (‘controller’) failures one after the other (any third or subsequent survival would be dependent on free disk space) although the chances of more than one node failing is obviously very small.

What if you’re still worried?

Both sync and async (snapshot-based) replication is included in the XIV base code i.e. no extra lic fees and no extra support fees. if you own an XIV then you own replication licences. Supported over both iSCSI and FC, so if you’re really worried about some of your mission critical data, replicate it – just like you would if you had a DMX or a DS8700.

I hope this helps explain how XIV’s intelligent disk management protects from double disk failure in particular.

Next post I will look at power usage – another area that has been subject to competitor FUD.

Regards, Jim

20 Responses

  1. Jim,

    It’s really a good article and I think it’s very accurate, but you don’t mention a couple of things:

    1) The 1 GbE connection between nodes. If you chop 1TB disks in 17GB “LUNs” and these in 1MB chunks, then you have ALMOST all disks with some piece of information from the original “LUN”. Then, you create volumes, imagine 200GB , so you use lots of chunks, LUNs, disks, nodes and LOTS of inter-node communications. So, you have latency, something you cannot forget in 1GbE.
    What happen with databases which needs low latency?

    2) Then you have the internal bandwidth in a system where you have no QoS. If one application asks for a lot of data, let say sequential reads, I guess you could saturate internal bandwidth, so other applications starve… I think this is a risk higher than 2 disks fail.
    I remember one case with a customer using iSeries and IXS. IXS is an Integrated xSeries Server using iSeries disks. iSeries is a system with a wide stripping, and you can create an special ASP (disk pool) for special purposes. This customer created the virtual disk in the main ASP and run WebSphere with a very IO intensive application. Conclusion, when disks reached 40% or higher utilization the IXS almost died. Was a very good lesson.

    3) You only talk about the 15 nodes system. How secure is an 8 nodes system, the one IBM is pushing harder?

    4) Why XIV and not DS5100 with 224 450Gb disks ? I think problem with DS5xxx is licensing, but I could ad a N6040 Gw to virtualize (If you want cheaper, try with 4 or 6 DS3400 full of disks) and provide inexpensive array, QoS, regulation compliance (financial services companies use to ask for mirroring) and good price.
    Half-Rack XIV (8 nodes) with discounts rounds 250K (was published in some countries as an Advantage offering, then price disappeared) is not a full XIV, only 27 TB, not the same redundancy…

    I think XIV is a great product, with some amazing concepts like ease of use, no-tiering, self-healing, aggressive price, no per-server licensing, replication, etc, but I think still lacks of some features to be a High-End system: QoS, Low Latency inter-node networking (Come on, at least 10 GbE), disk parity protection (compliance purposes and enhanced security), 10 GbE iSCSI, 8Gb FC

    I think that meanwhile is a good product for actual DAS, VMWare, small Unix and Windows customers.

    Like

    • Thanks, for your comments. You raise a range of issues. The points I would make in response are:
      The 6-node XIV has the same resilience story as the 15 node. The only difference is the number of nodes. Most of the systems sold are 15 nodes. The feedback I have seen from customers suggests they are seeing very good performance and low latency response for database apps. The guys who built this are very smart. I did hear about one situation in the US where the system was undersized (to replace a very heavily loaded IBM DS8300) but that was down to an inexperienced architect.

      Your assertion that XIV is good for smaller environments is not in line with what we are seeing in the field. We are seeing DMX and high-end CLARiiON being replaced by XIV and XIV delivering better performance. For anyone serious about evaluating XIV, it’s imperative that they talk to existing XIV customers. They know better than anyone what XIV can do.

      Like

      • I believe you and start guessing you use FC ports in different nodes to balance IO, so imagine a customer interested in XIV. How do you explain he can put mixed workloads in same system? How do you explain he needs XIV instead of DS5100 ? It’s a matter of price?

        Like

      • It’s not really accurate to say “the only difference is the number of nodes.”

        Many things change in a smaller system…too many to post in this comment…

        Like

      • SRJ – you have some space – tell me what changes. What I am trying to do with this blog is to have all opinions backed by some reasoning or evidence.

        Like

        • Typing from my phone right now, so I’ll leave it at one example just to support my claim.

          Rebuilds – in a 27TB system there are far fewer drives to participate in the redistribution process, so rebuilds definitely take longer than on a full system.

          Like

  2. Sorry, press submit before finishing. It’s a matter of enough IO so you can risk to mix workloads without QoS? It’s a matter of size?
    I understand this is the future for IBM, their own multi-purpose storage system, and I applaud the idea, but there is something I’m worried:
    IBM Storage sale specialists: They sell, no matter if it fits for the customer, and there’s no enough commercial information of XIV. I’ve seen even selling over DS5xxx “almost closed” deals, and still think not every customer applies for XIV.

    Like

    • So if the question is about positioning XIV and DS5000, the short answer is that XIV is positioned in the enterprise space – very high function and no time for downtime (e.g. no outage required for drive or controller firmware upgrades). DS5000 is positioned in the traditional midrange price/performance space. Personally I would rather buy XIV, but it’s up to each customer to choose what fits him best.

      Like

  3. Two drives failed and you can still survive? Only if you get to pick them.

    “To lose one disk, Mr Keller, may be regarded as a recoverable misfortune; to suggest that the XIV can lose any two and survive looks like carelessness.”

    Like

  4. Read my post carefully and you will understand that what you say is not true. Bring your evidence, not your FUD.

    Like

    • Have you checked out the XIV Google Wave? What Alex says here is imprecise and not 100% clear, but what he is basically saying is, in fact, true:

      Certain two-drive failures will cause a pretty big problem.

      Like

      • I’m trying to avoid the imprecision, especially when it is used mischievously.

        Like

        • Cheers to that!!!

          Like

        • Mischief is not what I’m about.

          Let’s try a bit of precision with language.

          Fail = not working
          Pre-fail, soft fail = working, but calculated to be at risk

          The XIV cannot tolerate any two disks failing. Selected disks, yes; it can tolerate certain combinations of failures (including the “pre-fail” category), but *at best* the risk profile of the XIV is the equivalent of running a leading vendor SAN at RAID-5 with 2 RAID groups of 1 parity drive and (on average) 78 data drives.

          It doesn’t matter how you spin it. SMART, checksums, scrubbing, pre-fail; the leading array brand names do all of that and often do more.

          Then NetApp in particular recognise one further risk that the XIV designers seem to have ignored.

          Near simultaneous double disk failures happen, and happen more frequently than you think. You don’t get to chose the disks that fail either.

          That’s why we run dual parity RAID. The risk is not reduced to zero, even with dual parity, but let me assure you and anyone else that’s reading; it’s *several thousand* times less likely to lose all your data than the XIV scheme.

          If you want the math, I can do it for you. It’s not difficult, and it might bring home to you that guessing & repeating urban storage myths rather than researching your facts makes it much harder to defend your position.

          Like

          • Hey, I work for IBM as a storage architect and I have been known to deliver detailed presentations in public fora on the mechanisms in ONTAP for error handling (based on an excellent paper by Rajesh Sundaram – “The Private Lives of Disk Drives – How Netapp protects against 5 dirty secrets”. It’s worth a read if you don’t know it already, but personally, looking to the future, I regard RAID6 as a band-aid for RAID5, something that will be history in 10 years time.

            When you say ‘the brand names do all of that’ you skip over the parts of my post where I explain why XIV is different in respect to these things. I am starting to wonder how many people actually read my post right through with attention.

            Like

            • The XIV is different how exactly? Look, a dead rive and an ECC error in any of 90 drives or so will kill it. Period. That’s certainly different.

              But what about the rest? High end arrays do ALL the things you claim for the XIV as uniques to protect data; and then some more, like half-decent RAID schemes.

              So what did I miss?

              Like

  5. Agree, 2 disk failures in specific nodes each one, during a 30 min windows, is likely as 2 controller failure. This is without taking in account that human operation has a higher failure rate than HW failure.
    I know people with monolithic systems like iSeries using about 120 disks, no controller redundancy, just RAID5, and with a good service contract they NEVER fail. And you cannot say iSeries/AS/400/IBM i is a bad system or you can loose your data using this systems.
    It’s a solution, not just spare parts like disks, cables and controllers…

    Like

  6. […] contributes to the blogging community are well accepted but, in a hard fighting arena like this, one of his first posts were a little bit ingenuous (IMHO)… and posts from other bloggers arrived […]

    Like

  7. Jim, great job. Yes, IBM XIV is very resilient against double drive failure. Although a DDF has yet to cause any customer to lose data, if it were ever to happen, only a few GB of data become inaccessible, the files on the affected LUNs can be identified and recovered in less time than a RAID5 rebuild. See my post here:

    https://www.ibm.com/developerworks/mydeveloperworks/blogs/InsideSystemStorage/entry/ddf-debunked-xiv-two-years-later?lang=en_us

    Tony Pearson (IBM)

    Like

  8. […] his post [XIV drive management], fellow blogger Jim Kelly (IBM) covers a variety of reasons why storage admins feel double drive […]

    Like

Leave a comment