One issue that urgently needs accuracy and clarity is the disk management technology behind XIV.
First some background on XIV:
XIV has a unique architecture in that it is a grid of 15 nodes, each node has its own 8GB cache and 12 drives. In fact XIV is a distributed cache grid, and as Moshe says, don’t get hung up on the drives, they are just containers. If you picture the modern concept of a tiered storage system that uses SSDs to automatically house hot segments (e.g. extents, sub-luns, blocks) and SATA to hold data that isn’t hot, then you might be starting to get an understanding of how XIV gets its performance. It doesn’t use SSDs, it uses RAM instead.
The concept of cache is not new, it was Moshe who is often credited with figuring out how to make large disk systems benefit from cache when he was back at EMC in the early 90′s. As it happens I was then at Data General where we had a new product called HADA (High Availability Disk Array) which won awards in 1991 and was later enhanced and renamed CLARiiON about ’94 (Data General was purchased by EMC in August ’99 and I spent 4 months as an EMC employee before moving to IBM).
The problem with cache in recent years is that big centralized caches provide diminishing performance returns. Smart management of a big shared cache can have a lot of system overhead. XIV sidesteps this by using 15 x 8GB distributed caches, each cache driven by 4 Intel cores and servicing 12 drives.
Unplanned outages are not that common on traditional disk systems, but when they do occur, most seem to me to be caused by one HA component taking out another. Let’s consider two examples:
- A drive fails in a 8 drive RAID set, a hot spare kicks in (which has not been used for some time) and the rebuild stress of 8 hours of intensive rebuild I/O causes a second drive failure or sometimes a controller failure or hang.
- A controller fails and somehow it takes down the other controller on a dual controller disk system. A colleague of mine was told by a customer that he was incensed when a vendor employee told him this was impossible and the two controllers failing at the same time was just a coincidence.
By comparison, let us consider the issues involved in XIV’s approach to protecting data in the event of drive or controller failure:
What if drives almost never failed?
How can this be, given that enterprise SATA has only 75% of the MTBF of enterprise FC/SAS? The question depends on what we mean when we say ‘fail’. I will distinguish between soft-fail (an administrative action taken by the XIV to ensure data protection) and hard-fail which is when a drive breaks and forces the XIV to react. What the XIV architecture seeks to do is to all but eliminate the hard fail.
XIV uses background scrubbing to detect potential drive block media errors and uses a relatively small number of media errors as a trigger for a soft-fail of that drive. Other vendors, if they are using scrubbing, generally use a much higher (perhaps 10 times higher) number of media errors before they trigger a rebuild, so they are less pro-active. Another one of the differences is that an administrative fail by XIV allows the data to be copied off that drive (effectively creating a third copy of the data), whereas a traditional array rebuild does not make use of the drive in question but re-creates the data from parity information. XIV’s approach is much faster.
XIV also uses S.M.A.R.T. to detect temperature, spin-up time, response time etc. SMART data on an XIV system is very linear because there are no hotspots, so it allows good decision making about drive health. In general SMART is not so smart – analysis of SMART data is all about trend analysis and when you have unpredictable I/O loads on a drive, SMART analysis really struggles to come to useful conclusions about drive health. One of the problems that plagues other vendors is false-positives because of erratic drive loads/hotspots, which leads them to ignore some SMART indications or set higher thresholds, thereby undermining the whole value of SMART. This is a widely acknowledged industry problem. But because XIV is so steady in its workload distribution, it isn’t so affected by this trend-analysis problem or by false positives.
What if re-protecting after a failure took almost no time?
Re-protection time for the data on a pre-failed drive is typically 20-30 minutes. As contrasted with several hours (sometimes days) on a typical RAID5/RAID6 FC RAID set on a traditional disk system.
What if the system could sustain multiple drive failures within a 30 minute window?
We know that XIV aggressively pre-fails drives so most re-protection incidents are the result of an XIV-initiated soft-fail. Sudden hard fails do exist, but they are relatively rare.
XIV multi-drive failure resilience works like this. If a second drive should show pre-fail signs during the 20-30 minute window, XIV will ignore it for the moment until it has finished dealing with the first drive. XIV can cope with multiple drives pre-failing simultaneously in this scenario. If the second drive is not a pre-fail, but a sudden hard-fail then XIV will leave the re-protection of the soft-failed drive (creation of third copies of data) for the moment and deal with the sudden-hard-failed drive and then go back and finish protecting the data on the soft-failed drive. Again, XIV can cope with many drives failing simultaneously as long as only one of them is a sudden hard fail, and even sudden hard fail drives can sometimes be restarted. Hitachi report that 80% of failed drives returned from from the field are “No Defect Found” and many of the other 20% were physically damaged by the end user (e.g. dropped). XIV support can dial in and spin up a drive if that should ever be required, but remember that there are a couple of thousand XIVs out there and none has ever experienced double disk failure.
Also two drives failing in the same module, or two drives failing across any of the six I/O modules will not cause XIV any problem at all. XIVs intelligent drive management keeps the data protected.
There are many comments out on the blogosphere and in competitive vendor presentations that accuse XIV of not being able to cope with two drives failing. These opinions are generally fairly speculative, but some seem to be deliberately sowing FUD.
What if re-protecting put almost no stress on the system?
Because the data that needs re-protecting is spread across 14 other data modules each with 8GB distributed cache and 12 drives, the XIV is reading from that number of modules and writing to that number of modules – all drives and caches are participating in the re-protection workload, unlike a traditional system that has one 8 drive RAID set being hammered for hours, and rebuilding onto a single drive that has been sitting idle for weeks or months. Rebuild stress on traditional systems is a major factor – the MTBF of any drive plummets if it is hammered for hours on end – this is when traditional systems are vulnerable to double drive failure or other related stress problems. So if XIV can avoid drive hammering, then even its SATA drives will be more reliable than a competing architecture which hammers FC drives during rebuilds.
What if there were six completely physically independent controllers?
The XIV has six I/O nodes that talk to each other over multiple GigE connections. There is no shared backplane. When a ‘controller’ fails (a node on the grid) there are another 5 ‘controllers’ still running, so instead of slowing by 50% as you do with most traditional systems, you only slow by perhaps 16%. I am told it takes 4-5 hours for a full node re-protection, after which you are running protected again – compare that to a traditional dual controller system where you are running at only 50% performance and completely exposed to a second failure until the engineer turns up with the parts to swap the failed controller. In some cases the XIV can sustain three or more node (‘controller’) failures one after the other (any third or subsequent survival would be dependent on free disk space) although the chances of more than one node failing is obviously very small.
What if you’re still worried?
Both sync and async (snapshot-based) replication is included in the XIV base code i.e. no extra lic fees and no extra support fees. if you own an XIV then you own replication licences. Supported over both iSCSI and FC, so if you’re really worried about some of your mission critical data, replicate it – just like you would if you had a DMX or a DS8700.
I hope this helps explain how XIV’s intelligent disk management protects from double disk failure in particular.
Next post I will look at power usage – another area that has been subject to competitor FUD.
Filed under: XIV