What Happens When a Controller Fails

Whether XIV is a visionary way to reduce your storage TCO, or just a bizarre piece of foolishness as some bloggers would have you believe, is being tested daily, and every day that passes with large customers enjoying the freedom of XIV ease of use, performance and reliability, is another vote for it being a visionary product.

XIV has been criticised because there is a chance that it might break, just like every other storage system that has ever been invented, yet because XIV is a little different, the nay-sayers somehow feel that non-perfection is a sin.

So let’s talk about non-perfection in an old-style storage architecture.

This morning I read on searchstorage <thanks guys> [link tweeted by jmrckins] of a disk system failure (lets keep this civilised by not specifying the vendor – besides it’s not about the vendor – its about the fact that none of us are perfect) caused by “a unique bug in the specific version of firmware on the system” which resulted in a major outage on April 16-17. An email outsource company was forced to refund charges to customers due to loss of service. The COO of said company sent a letter of explanation to customers and also posted information on the company blog about the failure. As follows:

“At approximately 6:15 a.m. PT on Thursday 4/16, a hardware failure occurred on one of the <named vendor> storage area networks (SANs) located in <provider’s> New Jersey data center. The service processor for one of the controller nodes had a failure. This failure caused the entire load for that SAN to be shifted to the service processor on the redundant controller node.”

The spare capacity on the single service processor was not enough to handle the entire load of all systems connected to the SAN, which caused a degradation of performance for the reading and writing of data to the SAN. The degradation of performance on the SAN in turn impacted the overall system’s ability to process email messages creating a queuing of several hundred thousand messages within the system. The back log was large enough that it took 32 hours for it to clear after the original event. At approximately 2 p.m. PT on Friday 4/17, all systems were functioning normally and mail delivery was considered to be “real-time.”

Bingo. It reads like a piece off an XIV brochure about the advantages of grid systems over dual controller systems. When one controller breaks on a busy system, you can suddenly be at half performance. Whereas on XIV you retain at least 84% performance.

Let’s read on…

“Over the next several weeks, we will be taking additional corrective actions to make certain that there is enough spare capacity on the SAN to guarantee that it performs without performance degradation in the case of a single hardware failure. An additional SAN is being installed this week and starting as early as this weekend we will begin to migrate a portion of the existing systems to the new SAN.”

So the message here is that if you want your old-style dual controller disk system to handle a failure, you need to size it double in the first place. That is typically a very expensive step to take. So where are the bloggers now, so concerned about your access to data, crying from the rooftops about how it’s unsafe to install a dual controller system that doesn’t have 100% controller headroom?

The COO continues…

“Additionally, we have engaged our SAN vendor to review the performance tuning of our SAN and implement adjustments to increase its overall performance capabilities. These events in tandem will guarantee that the SAN will be able to perform without an impact to the service in the event we experience another individual hardware error.”

What we see here is a reminder that, out in the wild, systems are badly tuned, that’s why self-tuning systems like XIV are becoming increasingly important.

Now we all know that controller failures don’t happen very often right? But hang on… Searchstorage discovered that a CTO at one of the companies who is a customer of this email provider had previously written a blog entry in March, describing a similar outage only a month or two earlier which says…

“Today I received their formal RFO (Reasons for Outage) letter via email which goes into great details describing why this outage occurred and what steps they are taking to try to prevent a re-occurrence for the same reasons in future. In a nutshell, there was a hardware failure in one of their <vendor> SAN devices, and this failure occurred in such a way that prevented the device’s own in-built fault tolerance mechanisms from allowing the SAN to effectively remain “up” – that is, they are saying this is one of those failures that should not have happened. These devices are designed precisely NOT to fail under such circumstances, but nonetheless it did fail.”

This highlights  a few points:

1. Disk systems are never 100% safe

2. Given that nothing is certain, replication is a good thing (and on XIV it’s ‘free’)

3. Bloggers who attack other vendor’s products for not being perfect, while ignoring the weaknesses in their own have drunk way too much of their company’s kool-aid.

Advertisements

3 Responses

  1. We agree on one thing; the XIV is bizarre. As to the conclusions;

    1. Disk systems are never 100% safe

    What you mean is; disk systems are never 100% reliable. There was no indication of data loss, which is my big issue with the XIV.

    2. Given that nothing is certain, replication is a good thing (and on XIV it’s ‘free’)

    I’m tempted to ask you for my free XIV. Please, don’t use the word unless you really, really mean it and you’re willing to do free. Which, of course, you aren’t.

    3. Bloggers who attack other vendor’s products for not being perfect, while ignoring the weaknesses in their own have drunk way too much of their company’s kool-aid.

    Ah, shravaka, you have much to learn. “So where are the bloggers now, so concerned about your access to data, crying from the rooftops about how it’s unsafe to install a dual controller system that doesn’t have 100% controller headroom?” They’re off writing on other more interesting subjects, because to be frank, this kind of sizing is covered in Capacity Planning 101, and all you’re doing is publicly calling out & embarrassing the customer, not the vendor.

    Like

    • Alex, in response to your comments…

      Number 3:

      “this kind of sizing is covered in Capacity Planning 101, and all you’re doing is publicly calling out & embarrassing the customer, not the vendor.”

      Nice deflection- reminds me of my 14 year old son. Let’s hold ourselves accountable, as the professionals we claim to be and stick to the issues. You had an outage that affected your customer- end of story.

      I am quite certain, based on their public statements that your customer does not agree with your comments- in their public interview, which was published before this post.

      Number 2:

      XIV is not free, but the replication software is included (for no additional charge- some might consider that free, as long as you have a maintenance contract).

      Number 1:

      Help me understand the data loss of a 7+1 R5 double drive failure (or whatever you want, besides R6) and XIV? How do you determine what data loss is acceptable, and what isn’t? And no, we don’t lose the whole array. To boot, we can tell you what sectors to recover and how to do that, versus “you need to recover all data on the 7+1 AG”.

      The problem described here is a problem with ANY dual controller device (IBM, HDS, NTAP, EMC, etc). How can you deny this, while still applying clean logic?

      I look forward to your technology based response.

      Like

  2. I think this kind of things happens, no-one is perfect.
    I love the concept of DS8x00 because of simplicity and reliability, and think that use to be criticized because of that simplicity.
    Another 3 letter vendor with a 3 letter + 1 number product use to sell this product based in system complexity, let’s say : The only product with such a lot of stuff in it, and it’s true, sounds awesome, but that kind of complexity is service-dependent.
    I’ve seen two financial market companies suffer from SAN outages because of a misconfiguration in a complex system port-mapping schema.
    Let’s go back to the DS8x00: I’ve seen one of this machines fail, because of a bad firmware, and the system was out-of-service for about 6 months , till someone released a patch to solve the incompatibility between the parts in the solution.
    Storage is a solution, not a system. There’s planning, testing, design, equipments, and sometimes one of this components fail.
    What I’m starting to understand about XIV is the performance consistency and simplicity. If I analyze the fact that not all servers get the full storage device performance, unless you use wide-stripping, sounds fantastic to have 17K IOPS in almost every server.
    But… all eggs in the same nest? And this are mechanical eggs ? Sounds really scary, because you plan against this kind of things. And dispersed storage with 2 copies of each chunk is not RAID-1.
    I think that XIV is different, a different way of thinking and planning and designing a solution and DR, period.
    What happened with this 3 letter company … business as usual.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: