Storage Spaghetti Anyone?

I recall Tom West (Chief Scientist at Data General, and star of Soul of a New Machine) once saying to me when he visited New Zealand that there was an old saying “Hardware lasts three years, Operating Systems last 20 years, but applications can go on forever.”

Over the years I have known many application developers and several development managers, and one thing that they seem to agree on is that it is almost impossible to maintain good code structure inside an app over a period of many years. The pressures of deadlines for features, changes in market, fashion and the way people use applications, the occasional weak programmer, and the occasional weak dev manager, or temporary lapse in discipline due to other pressures all contribute to fragmentation over time. It is generally by this slow attrition that apps end up being full of structural compromises and the occasional corner that is complete spaghetti.

I am sure there are exceptions, and there can be periodic rebuilds that improve things, but rebuilds are expensive.

If I think about the OS layer, I recall Data General rebuilding much of their DG/UX UNIX kernel to make it more structured because they considered the System V code to be pretty loose. Similarly IBM rebuilt UNIX into a more structured AIX kernel around the same time, and Digital UNIX (OSF/1) was also a rebuild based on Mach. Ironically HPUX eventually won out over Digital UNIX after the merger, with HPUX rumoured to be the much less structured product, a choice that I’m told has slowed a lot of ongoing development. Microsoft rebuilt Windows as NT and Apple rebuilt Mac OS to base it on the Mach kernel.

So where am I heading with this?

Well I have discussed this topic with a couple of people in recent times in relation to storage operating systems. If I line up some storage OS’s and their approximate date of original release you’ll see what I mean:

Netapp Data ONTAP 1992 22 years
EMC VNX / CLARiiON 1993 21 years
IBM DS8000 (assuming ESS code base) 1999 15 years
HP 3PAR 2002 12 years
IBM Storwize 2003 11 years
IBM XIV / Nextra 2006 8 years
Nimble Storage 2010 4 years

I’m not trying to suggest that this is a line-up in reverse order of quality, and no doubt some vendors might claim rebuilds or superb structural discipline, but knowing what I know about software development, the age of the original code is certainly a point of interest.

With the current market disruption in storage, cost pressures are bound to take their toll on development quality, and the problem is amplified if vendors try to save money by out-sourcing development to non-integrated teams in low-cost countries (e.g. build your GUI in Romania, or your iSCSI module in India).

Spaghetti

IBM Software-defined Storage

The phrase ‘Software-defined Storage’ (SDS) has quickly become one of the most widely used marketing buzz terms in storage. It seems to have originated from Nicira’s use of the term ‘Software-defined Networking’ and then adopted by VMware when they bought Nicira in 2012, where it evolved to become the ‘Software-defined Data Center’ including ‘Software-defined Storage’. VMware’s VSAN technology therefore has the top of mind position when we are talking about SDS. I really wish they’d called it something other than VSAN though, so as to avoid the clash with the ANSI T.11 VSAN standard developed by Cisco.

I have seen IBM regularly use the term ‘Software-defined Storage’ to refer to:

  1. GPFS
  2. Storwize family (which would include FlashSystem V840)
  3. Virtual Storage Center / Tivoli Storage Productivity Center

I recently saw someone at IBM referring to FlashSystem 840 as SDS even though to my mind it is very much a hardware/firmware-defined ultra-low-latency system with a very thin layer of software so as to avoid adding latency.

Interestingly, IBM does not seem to market XIV as SDS, even though it is clearly a software solution running on commodity hardware that has been ‘applianced’ so as to maintain reliability and supportability.

Let’s take a quick look at the contenders:

1. GPFS: GPFS is a file system with a lot of storage features built in or added-on, including de-clustered RAID, policy-based file tiering, snapshots, block replication, support for NAS protocols, WAN caching, continuous data protection, single namespace clustering, HSM integration, TSM backup integration, and even a nice new GUI. GPFS is the current basis for IBM’s NAS products (SONAS and V7000U) as well as the GSS (gpfs storage server) which is currently targeted at HPC markets but I suspect is likely to re-emerge as a more broadly targeted product in 2015. I get the impression that gpfs may well be the basis of IBM’s SDS strategy going forward.

2. Storwize: The Storwize family is derived from IBM’s SAN Volume Controller technology and it has always been a software-defined product, but tightly integrated to hardware so as to control reliability and supportability. In the Storwize V7000U we see the coming together of Storwize and gpfs, and at some point IBM will need to make the call whether to stay with the DS8000-derived RAID that is in Storwize currently, or move to the gpfs-based de-clustered RAID. I’d be very surprised if gpfs hasn’t already won that long-term strategy argument.

3. Virtual Storage Center: The next contender in the great SDS shootout is IBM’s Virtual Storage Center and it’s sub-component Tivoli Storage Productivity Center. Within some parts of IBM, VSC is talked about as the key to SDS. VSC is edition dependent but usually includes the SAN Volume Controller / Storwize code developed by IBM Systems and Technology Group, as well as the TPC and FlashCopy Manager code developed by IBM Software Group, plus some additional TPC analytics and automation. VSC gives you a tremendous amount of functionality to manage a large complex site but it requires real commitment to secure that value. I think of VSC and XIV as the polar opposites of IBM’s storage product line, even though some will suggest you do both. XIV drives out complexity based on a kind of 80/20 rule and VSC is designed to let you manage and automate a complex environment.

Commodity Hardware: Many proponents of SDS will claim that it’s not really SDS unless it runs on pretty much any commodity server. GPFS and VSC qualify by this definition, but Storwize does not, unless you count the fact that SVC nodes are x3650 or x3550 servers. However, we are already seeing the rise of certified VMware VSAN-ready nodes as a way to control reliability and supportability, so perhaps we are heading for a happy medium between the two extremes of a traditional HCL menu and a fully buttoned down appliance.

Product Strategy: While IBM has been pretty clear in defining its focus markets – Cloud, Analytics, Mobile, Social, Security (the ‘CAMSS’ message that is repeatedly referred to inside IBM) I think it has been somewhat less clear in articulating a clear and consistent storage strategy, and I am finding that as the storage market matures, smart people are increasingly wanting to know what the vendors’ strategies are. I say vendors plural because I see the same lack of strategic clarity when I look at EMC and HP for example. That’s not to say the products aren’t good, or the roadmaps are wrong, but just that the long-term strategy is either not well defined or not clearly articulated.

It’s easier for new players and niche players of course, and VMware’s Software-defined Storage strategy, for example, is both well-defined and clearly articulated, which will inevitably make it a baseline for comparison with the strategies of the traditional storage vendors.

A/NZ STG Symposium: For the A/NZ audience, if you want to understand IBM’s SDS product strategy, the 2014 STG Tech Symposium in August is the perfect opportunity. Speakers include Sven Oehme from IBM Research who is deeply involved with gpfs development, Barry Whyte from IBM STG in Hursley who is deeply involved in Storwize development, and Dietmar Noll from IBM in Frankfurt who is deeply involved in the development of Virtual Storage Center.

Melbourne – August 19-22

Auckland – August 26-28

My name is Storage and I’ll be your Server tonight…

Ever since companies like Data General moved RAID control into an external disk sub-system back in the early ’90s it has been standard received knowledge that servers and storage should be separate.

While the capital cost of storage in the server is generally lower than for an external centralised storage subsystem, having storage as part of each server creates fragmentation and higher operational management overhead. Asset life-cycle management is also a consideration – servers typically last 3 years and storage can often be sweated for 5 years since the pace of storage technology change has traditionally been slower than for servers.

When you look at some common storage systems however, what you see is that they do include servers that have been ‘applianced’ i.e. closed off to general apps, so as to ensure reliability and supportability.

  • IBM DS8000 includes two POWER/AIX servers
  • IBM SAN Volume Controller includes two IBM SystemX x3650 Intel/Linux servers
  • IBM Storwize is a custom variant of the above SVC
  • IBM Storwize V7000U includes a pair of x3650 file heads running RHEL and Tivoli Storage Manager (TSM) clients and Space Management (HSM) clients
  • IBM GSS (GPFS Storage Server) also uses a pair of x3650 servers, running RHEL

At one point the DS8000 was available with LPAR separation into two storage servers (intended to cater to a split production/non-production environment) and there was talk at the time of the possibility of other apps such as TSM being able to be loaded onto an LPAR (a feature that was never released).

Apps or features?: There are a bunch of apps that could be run on storage systems, and in fact many already are, except they are usually called ‘features’ rather than apps. The clearest examples are probably in the NAS world, where TSM and Space Management and SAMBA/CTDB and Ganesha/NFS, and maybe LTFS, for example, could all be treated as features.

I also recall Netapp once talking about a Fujitsu-only implementation of ONTAP that could be run in a VM on a blade server, and EMC has talked up the possibility of running apps on storage.

GPFS: In my last post I illustrated an example of using IBM’s GPFS to construct a server-based shared storage system. The challenge with these kinds of systems is that they put onus onto the installer/administrator to get it right, rather than the traditional storage appliance approach where the vendor pre-constructs the system.

Virtualization: Reliability and supportability are vital, but virtualization does allow the possibility that we could have ring-fenced partitions for core storage functions and still provide server capacity for a range of other data-oriented functions e.g. MapReduce, Hadoop, OpenStack Cinder & Swift, as well as apps like TSM and HSM, and maybe even things like compression, dedup, anti-virus, LTFS etc., but treated not so much as storage system features, but more as genuine apps that you can buy from 3rd parties or write yourself, just as you would with traditional apps on servers.

The question is not so much ‘can this be done’, but more, ‘is it a good thing to do’? Would it be a good thing to open up storage systems and expose the fact that these are truly software-defined systems running on servers, or does that just make support harder and add no real value (apart from providing a new fashion to follow in a fashion-driven industry)? My guess is that there is a gradual path towards a happy medium to be explored here.

IBM’s Scale-out FlashSystem Solution

IBM’s Flash strategy is a two-pronged approach, targeting the two segments that IDC labels as:

  1. Absolute Performance Flash
  2. Enterprise Flash

Last week I outlined the new FlashSystem 840 and focused mainly on the Absolute Performance aspect. Absolute Performance for IBM means latencies down around 95 microseconds write and 135 microseconds read, whereas most Flash storage systems in the market are talking 500+ microseconds best case. I’m guessing that in the new world of I/O bound applications, having 3 or 4 times the latency overhead could be a real problem for those vendors at some stage.

This week however I’d like to focus on the Enterprise Flash market segment.

Enterprise Flash

When we and IDC talk about Enterprise we are more concerned with the software stack and how it is used to address issues of:

  • Scalability
  • Snapshots & Clones
  • Replication
  • Storage Efficiency
  • Interoperability

The short answer to all of these is IBM’s SAN Volume Controller. Folks who are not very familiar with SVC often assume that SVC adds latency to storage. In the case of spinning disk systems, my experience has been that SVC reduces latency (due to intelligent caching effects) but takes about 5% of the top of maximum native IOPS. In the real world that means that things will almost always go faster with SVC than without it.

Scale-out Flash Latency

In the case of Flash, the picture is slightly different. The latencies of the FlashSystem 840 are so low that SVC caching does not fully compensate for other effects and the nett is that putting SVC in front of your FlashSystem 840 is likely to add around 100 micro-seconds of latency.

Yes that’s right, only 100 micro-seconds. I should add that I have not personally verified this, but have been told that is what we are seeing in IBM’s internal lab tests.

When you add 100 micro-seconds to the low latency of the FlashSystem 840 (95 microseconds write, 135 microseconds read) you still have numbers down below 250 microseconds, which is twice as fast as the numbers quoted on products like XtremIO and Violin 6200.

Even way back in 2008 we announced a benchmark result of 1 million IOPS with SVC and Flash, code-named Quicksilver. At the time the IBM statement said that IBM was planning a complete end-to-end systems approach to Flash and…

“Performance improvements of this magnitude can have profound implications for business, allowing two to three times the work to [be completed] in a given time frame for . . . time-sensitive applications like reservations systems, and financial program trading systems, and creating opportunity for entirely new insights in information-warehouses and analytics solutions”

So this is not new for IBM. The recently announced FlashSystem Solution with SVC is the culmination of six years of preparation (including SVC tuning) by IBM.

Full Enterprise Software Function Set

So you can understand now why IBM does not need to reinvent a whole separate scale-out offering of the sort that Whiptail Invicta (Cisco’s new EMC killer) and XtremIO Cluster (EMC’s new fat-boy SSD system) have tried to create. IBM can deliver a much more mature and feature-rich solution with consistent management and feature functions right across the board from the small V3700 with Easy Tier Flash right through to high-end SVC Flash Solutions like the one implemented by Sprint in 2013.

An Elegant Scale-Out Flash Solution

SVC brings proven data center credentials to scale-out Flash, delivering the full Storwize software stack while adding as little as 100 microseconds of latency. That is a good story and one that will not be easily matched by any competitor, and if the market would prefer something that is more tightly coupled from a hardware point of view then I don’t see why IBM couldn’t also deliver that in future if it wanted to.

So IBM has avoided the need to reinvent, develop, or buy-in a new immature scale-out mechanism for Flash. By using SVC you get FlashCopy snapshots and clones, as well as volume replication over IP, and Real-time Compression. But possibly most important of all is the full SVC interoperability matrix. How’s that for a software defined storage strategy that delivers rapid time-to-value in exactly the way it’s meant to.

For more info you can check out the IBM FlashSystem product page and the IBM Redbook Solution Guide “Implementing FlashSystem 840 with SAN Volume Controller

IBM FlashSystem Solution

IBM FlashSystem 840 for Legacy-free Flash

Flash storage is at an interesting place and it’s worth taking the time to understand IBM’s new FlashSystem 840 and how it might be useful.

A traditional approach to flash is to treat it like a fast disk drive with a SAS interface, and assume that a faster version of traditional systems are the way of the future. This is not a bad idea, and with auto-tiering technologies this kind of approach was mastered by the big vendors some time ago, and can be seen for example in IBM’s Storwize family and DS8000, and as a cache layer in the XIV. Using auto-tiering we can perhaps expect large quantities of storage to deliver latencies around 5 millseconds, rather than a more traditional 10 ms or higher (e.g. MS Exchange’s jetstress test only fails when you get to 20 ms).

No SSDs 3

Some players want to use all SSDs in their disk systems, which you can do with Storwize for example, but this is again really just a variation on a fairly traditional approach and you’re generally looking at storage latencies down around one or two millseconds. That sounds pretty good compared to 10 ms, but there are ways to do better and I suspect that SSD-based systems will not be where it’s at in 5 years time.

The IBM FlashSystem 840 is a little different and it uses flash chips, not SSDs. It’s primary purpose is to be very very low latency. We’re talking as low as 90 microseconds write, and 135 microseconds read. This is not a traditional system with a soup-to-nuts software stack. FlashSystem has a new Storwize GUI, but it is stripped back to keep it simple and to avoid anything that would impact latency.

This extreme low latency is a unique IBM proposition, since it turns out that even when other vendors use MLC flash chips instead of SSDs, by their own admission they generally still end up with latency close to 1 ms, presumably because of their controller and code-path overheads.

FlashSystem 840

  • 2u appliance with hot swap modules, power and cooling, controllers etc
  • Concurrent firmware upgrade and call-home support
  • Encryption is standard
  • Choice of 16G FC, 8G FC, 40G IB and 10G FCoE interfaces
  • Choice of upgradeable capacity
Nett of 2-D RAID5 4 modules 8 modules 12 modules
2GB modules 4 TB 12 TB 20 TB
4GB modules 8 TB 24 TB 40 TB
  • Also a 2 TB starter option with RAID0
  • Each module has 10 flash chips and each chip has 16 planes
  • RAID5 is applied both across modules and within modules
  • Variable stripe RAID within modules is self-healing

I’m thinking that prime targets for these systems include Databases and VDI, but also folks looking to future-proof their general performance. If you’re making a 5 year purchase, not everyone will want to buy a ‘mature’ SSD legacy-style flash solution, when they could instead buy into a disk-free architecture of the future.

But, as mentioned, FlashSystem does not have a full traditional software stack, so let’s consider the options if you need some of that stuff:

  • IMHO, when it comes to replication, databases are usually best replicated using log shipping, Oracle Data Guard etc.
  • VMware volumes can be replicated with native VMware server-based tools.
  • AIX volumes can be replicated using AIX Geographic Mirroring.
  • On AIX and some other systems you can use logical volume mirroring to set up a mirror of your volumes with preferred read set to the FlashSystem 840, and writes mirrored to a V7000 or (DS8000 or XIV etc), thereby allowing full software stack functions on the volumes (on the V7000) without slowing down the reads off the FlashSystem.
  • You can also virtualize FlashSystem behind SVC or V7000
  • Consider using Tivoli Storage Manager dedup disk to disk to create a DR environment

Right now, FlashSystem 840 is mainly about screamingly low latency and high performance, with some reasonable data center class credentials, and all at a pretty good price. If you have a data warehouse, or a database that wants that kind of I/O performance, or a VDI implementation that you want to de-risk, or a general workload that you want to future-proof, then maybe you should talk to IBM about FlashSystem 840.

Meanwhile I suggest you check out these docs:

Another Storwize Global Mirror Best Practice Tip

Tip: When running production-style workloads alongside Global Mirror continuous replication secondary volumes on one Storwize system, best practice is to put the production and DR workloads into separate pools. This is especially important when the production workloads are write intensive.

Aside from write-intensive OLTP, OLAP etc, large file copies (e.g. zipping a 10GB flat file database export) can be the biggest hogs of write resource (cache and disk), especially where the backend disk is not write optimised (e.g. RAID6).

Write Cache Partitioning

Global Mirror continuous replication requires a fast clean path for writes at the target site. If it doesn’t get that it places heavy demands on the write cache at the target site. If that write cache is already heavily committed it creates back-pressure through Global Mirror through to the source system. However, if you create more than one pool on your Storwize system it will manage quality of service for the write cache on a pool by pool basis:

Pools on your system

Max % of write cache any one pool can use

1

100%

2

66%

3

40%

4

30%

5

25%

RAID6 for Write Intensive Workloads?

If you are thinking of using RAID6 in your Global Mirror continuous replication target pool, you might also want to consider instead using RAID10, or maybe using RAID6 with Easy Tier (SSD assist). As an example, Disk Magic suggests that when comparing the following two options with 100% write workload (16KB I/O size):
  • 10 x 4TB NL-SAS 7200RPM RAID1 (nett 18TiB)
  • 22 x 1200GB SAS 10KRPM 9+2 RAID6 (nett 19TiB)

Not only is the RAID1 option much lower cost, but it is also ~10% faster. I’m not 100% sure we want to encourage folks to use 7200RPM drives at the Global Mirror target side, but the point I’m making is that RAID6 is not really ideal in a 100% write environment. Of course using Easy Tier (SSD assist) can help enormously [added 29th April 2014] in some situations, but not really with Global Mirror targets since the copy grain size is 256KiB and Easy Tier will ignore everything over 64KiB.

Global Mirror with Change Volumes

Global Mirror continuous replication is not synchronous, but typically runs at a lag of less than 100 ms. One way to avoid resource contention issues is to use Global Mirror with Change Volumes (snapshot-based replication) which severs the back-pressure link completely, leaving your production volumes to run free : )

Removing a managed disk non-disruptively from a pool

If however you find yourself in the position of having a workload issue on your Global Mirror target volumes and you want to keep using continuous replication, Storwize allows you to non-disruptively depopulate a managed disk (RAID set) from the pool (assuming you have enough free capacity) so you can create a separate pool from that mdisk.

IBM Storwize 7.2 wins by a SANSlide

So following my recent blog post on SANSlide WAN optimization appliances for use with Storwize replication, IBM has just announced Storwize 7.2 (available December) which includes not only replication natively over IP networks (licensed as Global Mirror/Metro Mirror) but also has SANslide WAN optimization built-in for free. i.e. to get the benefits of WAN optimization you no longer need to purchase Riverbed or Cisco WAAS or SANSlide appliances.

Admittedly, Global Mirror was a little behind the times in getting to a native IP implementation, but having got there, the developers obviously decided they wanted to do it in style and take the lead in this space, by offering a more optimized WAN replication experience than any of our competitors.

The industry problem with TCP/IP latency is the time it takes to acknowledge that your packets have arrived at the other end. You can’t send the next set of packets until you get that acknowledgement back. So on a high latency network you end up spending a lot of your time waiting, which means you can’t take proper advantage of the available bandwidth. Effective bandwidth usage can sometimes be reduced to only 20% of the actual bandwidth you are paying for.

Round trip latency

The first time I heard this story was actually back in the mid-90’s from a telco network engineer. His presentation was entitled something like “How latency can steal your bandwidth”.

SANSlide mitigates latency by virtualising the pipe with many connections. While one connection is waiting for the ACK another is sending data. Using many connections, the pipe can often be filled more than 95%.

SANSlide virtual links

If you have existing FCIP routers you don’t need to rush out and switch over to IP replication with SANSlide, especially if your latency is reasonably low, but if you do have a high latency network it would be worth discussing your options with your local IBM Storwize expert. It might depend on the sophistication of your installed FCIP routers. Brocade for example suggests that the IBM SAN06B-R is pretty good at WAN optimization. So the graph below does not necessarily apply to all FCIP routers.

SANSlide Throughput

When you next compare long distance IBM Storwize replication to our competitors’ offerings, you might want to ask them to include the cost of WAN optimization appliances to get a full apples for apples comparison, or you might want to take into account that with IBM Storwize you will probably need a lot less bandwidth to achieve the same RPO.

Even when others do include products like Riverbed appliances with their offerings, SANSlide still has the advantage of being completely data-agnostic, so it doesn’t get confused or slow down when transmitting encrypted or compressed data like most other WAN optimization appliances do.

Free embedded SANSlide is only one of the cool new things in the IBM Storwize world. The folks in Hursley have been very busy. Check out Barry Whyte’s blog entry and the IBM Storwize product page if you haven’t done so already.

%d bloggers like this: