Whether XIV is a visionary way to reduce your storage TCO, or just a bizarre piece of foolishness as some bloggers would have you believe, is being tested daily, and every day that passes with large customers enjoying the freedom of XIV ease of use, performance and reliability, is another vote for it being a visionary product.
XIV has been criticised because there is a chance that it might break, just like every other storage system that has ever been invented, yet because XIV is a little different, the nay-sayers somehow feel that non-perfection is a sin.
So let’s talk about non-perfection in an old-style storage architecture.
This morning I read on searchstorage <thanks guys> [link tweeted by jmrckins] of a disk system failure (lets keep this civilised by not specifying the vendor – besides it’s not about the vendor – its about the fact that none of us are perfect) caused by “a unique bug in the specific version of firmware on the system” which resulted in a major outage on April 16-17. An email outsource company was forced to refund charges to customers due to loss of service. The COO of said company sent a letter of explanation to customers and also posted information on the company blog about the failure. As follows:
“At approximately 6:15 a.m. PT on Thursday 4/16, a hardware failure occurred on one of the <named vendor> storage area networks (SANs) located in <provider’s> New Jersey data center. The service processor for one of the controller nodes had a failure. This failure caused the entire load for that SAN to be shifted to the service processor on the redundant controller node.”
The spare capacity on the single service processor was not enough to handle the entire load of all systems connected to the SAN, which caused a degradation of performance for the reading and writing of data to the SAN. The degradation of performance on the SAN in turn impacted the overall system’s ability to process email messages creating a queuing of several hundred thousand messages within the system. The back log was large enough that it took 32 hours for it to clear after the original event. At approximately 2 p.m. PT on Friday 4/17, all systems were functioning normally and mail delivery was considered to be “real-time.”
Bingo. It reads like a piece off an XIV brochure about the advantages of grid systems over dual controller systems. When one controller breaks on a busy system, you can suddenly be at half performance. Whereas on XIV you retain at least 84% performance.
Let’s read on…
“Over the next several weeks, we will be taking additional corrective actions to make certain that there is enough spare capacity on the SAN to guarantee that it performs without performance degradation in the case of a single hardware failure. An additional SAN is being installed this week and starting as early as this weekend we will begin to migrate a portion of the existing systems to the new SAN.”
So the message here is that if you want your old-style dual controller disk system to handle a failure, you need to size it double in the first place. That is typically a very expensive step to take. So where are the bloggers now, so concerned about your access to data, crying from the rooftops about how it’s unsafe to install a dual controller system that doesn’t have 100% controller headroom?
The COO continues…
“Additionally, we have engaged our SAN vendor to review the performance tuning of our SAN and implement adjustments to increase its overall performance capabilities. These events in tandem will guarantee that the SAN will be able to perform without an impact to the service in the event we experience another individual hardware error.”
What we see here is a reminder that, out in the wild, systems are badly tuned, that’s why self-tuning systems like XIV are becoming increasingly important.
Now we all know that controller failures don’t happen very often right? But hang on… Searchstorage discovered that a CTO at one of the companies who is a customer of this email provider had previously written a blog entry in March, describing a similar outage only a month or two earlier which says…
“Today I received their formal RFO (Reasons for Outage) letter via email which goes into great details describing why this outage occurred and what steps they are taking to try to prevent a re-occurrence for the same reasons in future. In a nutshell, there was a hardware failure in one of their <vendor> SAN devices, and this failure occurred in such a way that prevented the device’s own in-built fault tolerance mechanisms from allowing the SAN to effectively remain “up” – that is, they are saying this is one of those failures that should not have happened. These devices are designed precisely NOT to fail under such circumstances, but nonetheless it did fail.”
This highlights a few points:
1. Disk systems are never 100% safe
2. Given that nothing is certain, replication is a good thing (and on XIV it’s ‘free’)
3. Bloggers who attack other vendor’s products for not being perfect, while ignoring the weaknesses in their own have drunk way too much of their company’s kool-aid.
Filed under: XIV |