IBM GPFS – Software Defined Storage

GPFS (General Parallel File System) is one of those very cool technologies that you can do so much with that it’s actually fun to design solutions with it (provided you’re the kind of person that also gets a kick from a nice elegant mathematical proof by induction).

Back in 2010 I was asked by an IBM systems software strategist for my opinion as to whether GPFS had potential as a mainstream product, or if it was best kept back as an underlying component in mainstream solutions. I was strongly in the component camp, but now I almost regret that, because it may be that really the only thing that was holding GPFS back was the lack of its own comprehensive GUI. That is something I still hope will be addressed in the not too distant future.

Anyway, this is a sample design that attempts to show some of the things you can do with GPFS by way of building a software defined storage and server environment.

The central box shows GPFS servers (virtualized in this example) and the left and right boxes show GPFS clients. GPFS also supports ILM policies between disk tiers and out to LTFS tape, as well as optional integration with HSM (via Tivoli Space Management) and fast efficient backup with Tivoli Storage Manager.

GPFS Software Defined Storage v4

There are of course a few caveats and restrictions. Check out the GPFS infocenter for the technical details.

This second diagram shows a simpler view of how to build a highly available software defined storage environment. The example shows two physical servers, but you can add many servers and still have a single storage pool. Mirroring is on a per volume basis. Also you could use GPFS native RAID to build a RAID6 array in each server for example.

VMware gpfs

Building Scale-out NAS with IBM Storwize V7000 Unified

If you need scalable NAS and what you’re primarily interested in is capacity scaling, with less emphasis on performance, and more on cost-effective entry price, then you might be interested in building a scale-out NAS from Storwize V7000 Unified systems, especially if you have some block I/O requirements to go with your NAS requirements.

There are three ways that I can think of doing this and each has its strengths. The documentation on these options is not always easy to find, so these diagrams and bullets might help to show what is possible.

One key point that is not well understood is that clustering V7000 systems to a V7000U allows SMB2 access to all of the capacity – a feature that you don’t get if you virtualize secondary systems rather than cluster them.

V7000U Cluster

V7000U Virtualization

V7000U Active Cloud

And of course systems management is made relatively simple with the Storwize GUI.

V7000U

IBM’s Scale-out FlashSystem Solution

IBM’s Flash strategy is a two-pronged approach, targeting the two segments that IDC labels as:

  1. Absolute Performance Flash
  2. Enterprise Flash

Last week I outlined the new FlashSystem 840 and focused mainly on the Absolute Performance aspect. Absolute Performance for IBM means latencies down around 95 microseconds write and 135 microseconds read, whereas most Flash storage systems in the market are talking 500+ microseconds best case. I’m guessing that in the new world of I/O bound applications, having 3 or 4 times the latency overhead could be a real problem for those vendors at some stage.

This week however I’d like to focus on the Enterprise Flash market segment.

Enterprise Flash

When we and IDC talk about Enterprise we are more concerned with the software stack and how it is used to address issues of:

  • Scalability
  • Snapshots & Clones
  • Replication
  • Storage Efficiency
  • Interoperability

The short answer to all of these is IBM’s SAN Volume Controller. Folks who are not very familiar with SVC often assume that SVC adds latency to storage. In the case of spinning disk systems, my experience has been that SVC reduces latency (due to intelligent caching effects) but takes about 5% of the top of maximum native IOPS. In the real world that means that things will almost always go faster with SVC than without it.

Scale-out Flash Latency

In the case of Flash, the picture is slightly different. The latencies of the FlashSystem 840 are so low that SVC caching does not fully compensate for other effects and the nett is that putting SVC in front of your FlashSystem 840 is likely to add around 100 micro-seconds of latency.

Yes that’s right, only 100 micro-seconds. I should add that I have not personally verified this, but have been told that is what we are seeing in IBM’s internal lab tests.

When you add 100 micro-seconds to the low latency of the FlashSystem 840 (95 microseconds write, 135 microseconds read) you still have numbers down below 250 microseconds, which is twice as fast as the numbers quoted on products like XtremIO and Violin 6200.

Even way back in 2008 we announced a benchmark result of 1 million IOPS with SVC and Flash, code-named Quicksilver. At the time the IBM statement said that IBM was planning a complete end-to-end systems approach to Flash and…

“Performance improvements of this magnitude can have profound implications for business, allowing two to three times the work to [be completed] in a given time frame for . . . time-sensitive applications like reservations systems, and financial program trading systems, and creating opportunity for entirely new insights in information-warehouses and analytics solutions”

So this is not new for IBM. The recently announced FlashSystem Solution with SVC is the culmination of six years of preparation (including SVC tuning) by IBM.

Full Enterprise Software Function Set

So you can understand now why IBM does not need to reinvent a whole separate scale-out offering of the sort that Whiptail Invicta (Cisco’s new EMC killer) and XtremIO Cluster (EMC’s new fat-boy SSD system) have tried to create. IBM can deliver a much more mature and feature-rich solution with consistent management and feature functions right across the board from the small V3700 with Easy Tier Flash right through to high-end SVC Flash Solutions like the one implemented by Sprint in 2013.

An Elegant Scale-Out Flash Solution

SVC brings proven data center credentials to scale-out Flash, delivering the full Storwize software stack while adding as little as 100 microseconds of latency. That is a good story and one that will not be easily matched by any competitor, and if the market would prefer something that is more tightly coupled from a hardware point of view then I don’t see why IBM couldn’t also deliver that in future if it wanted to.

So IBM has avoided the need to reinvent, develop, or buy-in a new immature scale-out mechanism for Flash. By using SVC you get FlashCopy snapshots and clones, as well as volume replication over IP, and Real-time Compression. But possibly most important of all is the full SVC interoperability matrix. How’s that for a software defined storage strategy that delivers rapid time-to-value in exactly the way it’s meant to.

For more info you can check out the IBM FlashSystem product page and the IBM Redbook Solution Guide “Implementing FlashSystem 840 with SAN Volume Controller

IBM FlashSystem Solution

IBM FlashSystem 840 for Legacy-free Flash

Flash storage is at an interesting place and it’s worth taking the time to understand IBM’s new FlashSystem 840 and how it might be useful.

A traditional approach to flash is to treat it like a fast disk drive with a SAS interface, and assume that a faster version of traditional systems are the way of the future. This is not a bad idea, and with auto-tiering technologies this kind of approach was mastered by the big vendors some time ago, and can be seen for example in IBM’s Storwize family and DS8000, and as a cache layer in the XIV. Using auto-tiering we can perhaps expect large quantities of storage to deliver latencies around 5 millseconds, rather than a more traditional 10 ms or higher (e.g. MS Exchange’s jetstress test only fails when you get to 20 ms).

No SSDs 3

Some players want to use all SSDs in their disk systems, which you can do with Storwize for example, but this is again really just a variation on a fairly traditional approach and you’re generally looking at storage latencies down around one or two millseconds. That sounds pretty good compared to 10 ms, but there are ways to do better and I suspect that SSD-based systems will not be where it’s at in 5 years time.

The IBM FlashSystem 840 is a little different and it uses flash chips, not SSDs. It’s primary purpose is to be very very low latency. We’re talking as low as 90 microseconds write, and 135 microseconds read. This is not a traditional system with a soup-to-nuts software stack. FlashSystem has a new Storwize GUI, but it is stripped back to keep it simple and to avoid anything that would impact latency.

This extreme low latency is a unique IBM proposition, since it turns out that even when other vendors use MLC flash chips instead of SSDs, by their own admission they generally still end up with latency close to 1 ms, presumably because of their controller and code-path overheads.

FlashSystem 840

  • 2u appliance with hot swap modules, power and cooling, controllers etc
  • Concurrent firmware upgrade and call-home support
  • Encryption is standard
  • Choice of 16G FC, 8G FC, 40G IB and 10G FCoE interfaces
  • Choice of upgradeable capacity
Nett of 2-D RAID5 4 modules 8 modules 12 modules
2GB modules 4 TB 12 TB 20 TB
4GB modules 8 TB 24 TB 40 TB
  • Also a 2 TB starter option with RAID0
  • Each module has 10 flash chips and each chip has 16 planes
  • RAID5 is applied both across modules and within modules
  • Variable stripe RAID within modules is self-healing

I’m thinking that prime targets for these systems include Databases and VDI, but also folks looking to future-proof their general performance. If you’re making a 5 year purchase, not everyone will want to buy a ‘mature’ SSD legacy-style flash solution, when they could instead buy into a disk-free architecture of the future.

But, as mentioned, FlashSystem does not have a full traditional software stack, so let’s consider the options if you need some of that stuff:

  • IMHO, when it comes to replication, databases are usually best replicated using log shipping, Oracle Data Guard etc.
  • VMware volumes can be replicated with native VMware server-based tools.
  • AIX volumes can be replicated using AIX Geographic Mirroring.
  • On AIX and some other systems you can use logical volume mirroring to set up a mirror of your volumes with preferred read set to the FlashSystem 840, and writes mirrored to a V7000 or (DS8000 or XIV etc), thereby allowing full software stack functions on the volumes (on the V7000) without slowing down the reads off the FlashSystem.
  • You can also virtualize FlashSystem behind SVC or V7000
  • Consider using Tivoli Storage Manager dedup disk to disk to create a DR environment

Right now, FlashSystem 840 is mainly about screamingly low latency and high performance, with some reasonable data center class credentials, and all at a pretty good price. If you have a data warehouse, or a database that wants that kind of I/O performance, or a VDI implementation that you want to de-risk, or a general workload that you want to future-proof, then maybe you should talk to IBM about FlashSystem 840.

Meanwhile I suggest you check out these docs:

Another Storwize Global Mirror Best Practice Tip

Tip: When running production-style workloads alongside Global Mirror continuous replication secondary volumes on one Storwize system, best practice is to put the production and DR workloads into separate pools. This is especially important when the production workloads are write intensive.

Aside from write-intensive OLTP, OLAP etc, large file copies (e.g. zipping a 10GB flat file database export) can be the biggest hogs of write resource (cache and disk), especially where the backend disk is not write optimised (e.g. RAID6).

Write Cache Partitioning

Global Mirror continuous replication requires a fast clean path for writes at the target site. If it doesn’t get that it places heavy demands on the write cache at the target site. If that write cache is already heavily committed it creates back-pressure through Global Mirror through to the source system. However, if you create more than one pool on your Storwize system it will manage quality of service for the write cache on a pool by pool basis:

Pools on your system

Max % of write cache any one pool can use

1

100%

2

66%

3

40%

4

30%

5

25%

RAID6 for Write Intensive Workloads?

If you are thinking of using RAID6 in your Global Mirror continuous replication target pool, you might also want to consider instead using RAID10, or maybe using RAID6 with Easy Tier (SSD assist). As an example, Disk Magic suggests that when comparing the following two options with 100% write workload (16KB I/O size):
  • 10 x 4TB NL-SAS 7200RPM RAID1 (nett 18TiB)
  • 22 x 1200GB SAS 10KRPM 9+2 RAID6 (nett 19TiB)

Not only is the RAID1 option much lower cost, but it is also ~10% faster. I’m not 100% sure we want to encourage folks to use 7200RPM drives at the Global Mirror target side, but the point I’m making is that RAID6 is not really ideal in a 100% write environment. Of course using Easy Tier (SSD assist) can help enormously.

Global Mirror with Change Volumes

Global Mirror continuous replication is not synchronous, but typically runs at a lag of less than 100 ms. One way to avoid resource contention issues is to use Global Mirror with Change Volumes (snapshot-based replication) which severs the back-pressure link completely, leaving your production volumes to run free : )

Removing a managed disk non-disruptively from a pool

If however you find yourself in the position of having a workload issue on your Global Mirror target volumes and you want to keep using continuous replication, Storwize allows you to non-disruptively depopulate a managed disk (RAID set) from the pool (assuming you have enough free capacity) so you can create a separate pool from that mdisk.

IBM Storwize 7.2 wins by a SANSlide

So following my recent blog post on SANSlide WAN optimization appliances for use with Storwize replication, IBM has just announced Storwize 7.2 (available December) which includes not only replication natively over IP networks (licensed as Global Mirror/Metro Mirror) but also has SANslide WAN optimization built-in for free. i.e. to get the benefits of WAN optimization you no longer need to purchase Riverbed or Cisco WAAS or SANSlide appliances.

Admittedly, Global Mirror was a little behind the times in getting to a native IP implementation, but having got there, the developers obviously decided they wanted to do it in style and take the lead in this space, by offering a more optimized WAN replication experience than any of our competitors.

The industry problem with TCP/IP latency is the time it takes to acknowledge that your packets have arrived at the other end. You can’t send the next set of packets until you get that acknowledgement back. So on a high latency network you end up spending a lot of your time waiting, which means you can’t take proper advantage of the available bandwidth. Effective bandwidth usage can sometimes be reduced to only 20% of the actual bandwidth you are paying for.

Round trip latency

The first time I heard this story was actually back in the mid-90′s from a telco network engineer. His presentation was entitled something like “How latency can steal your bandwidth”.

SANSlide mitigates latency by virtualising the pipe with many connections. While one connection is waiting for the ACK another is sending data. Using many connections, the pipe can often be filled more than 95%.

SANSlide virtual links

If you have existing FCIP routers you don’t need to rush out and switch over to IP replication with SANSlide, especially if your latency is reasonably low, but if you do have a high latency network it would be worth discussing your options with your local IBM Storwize expert. It might depend on the sophistication of your installed FCIP routers. Brocade for example suggests that the IBM SAN06B-R is pretty good at WAN optimization. So the graph below does not necessarily apply to all FCIP routers.

SANSlide Throughput

When you next compare long distance IBM Storwize replication to our competitors’ offerings, you might want to ask them to include the cost of WAN optimization appliances to get a full apples for apples comparison, or you might want to take into account that with IBM Storwize you will probably need a lot less bandwidth to achieve the same RPO.

Even when others do include products like Riverbed appliances with their offerings, SANSlide still has the advantage of being completely data-agnostic, so it doesn’t get confused or slow down when transmitting encrypted or compressed data like most other WAN optimization appliances do.

Free embedded SANSlide is only one of the cool new things in the IBM Storwize world. The folks in Hursley have been very busy. Check out Barry Whyte’s blog entry and the IBM Storwize product page if you haven’t done so already.

SANSlide WAN Optimization Appliances

WAN optimization is not something that storage vendors traditionally put into their storage controllers. Storage replication traffic has to fend for itself out in the WAN world, and replication performance will usually suffer unless there are specific WAN optimization devices installed in the network.

For example, Netapp recommends Cisco WAAS as:

“an application acceleration and WAN optimization solution that allows storage managers to dramatically improve NetApp SnapMirror performance over the WAN.”

…because:

“…the rated throughput of high-bandwidth links cannot be fully utilized due to TCP behavior under conditions of high latency and high packet loss.”

EMC similarly endorses a range of WAN optimization products including those from Riverbed and Silver Peak.

Back in July, an IBM redpaper entitled “IBM Storwize V7000 and SANSlide Implementation” slipped quietly onto the IBM redbooks site. The redpaper tells us that:

this combination of SANSlide and the Storwize V7000 system provides a powerful solution for clients who require efficient, IP-based replication over long distances.

Bridegworks SANSlide provides WAN optimization, delivering much higher throughput on medium to high latency IP networks. This graph is from the redpaper:

SANSlide improvement

Bridgeworks also advises that:

On the commercial front the company is expanding its presence with OEM partners and building a network of distributors and value-added partners both in its home market and around the world.

Anyone interested in replication using any of the Storwize family (including SVC) should probably check out the redpaper, even if only as a little background reading.

Follow

Get every new post delivered to your Inbox.

Join 146 other followers