My name is Storage and I’ll be your Server tonight…

Ever since companies like Data General moved RAID control into an external disk sub-system back in the early ’90s it has been standard received knowledge that servers and storage should be separate.

While the capital cost of storage in the server is generally lower than for an external centralised storage subsystem, having storage as part of each server creates fragmentation and higher operational management overhead. Asset life-cycle management is also a consideration – servers typically last 3 years and storage can often be sweated for 5 years since the pace of storage technology change has traditionally been slower than for servers.

When you look at some common storage systems however, what you see is that they do include servers that have been ‘applianced’ i.e. closed off to general apps, so as to ensure reliability and supportability.

  • IBM DS8000 includes two POWER/AIX servers
  • IBM SAN Volume Controller includes two IBM SystemX x3650 Intel/Linux servers
  • IBM Storwize is a custom variant of the above SVC
  • IBM Storwize V7000U includes a pair of x3650 file heads running RHEL and Tivoli Storage Manager (TSM) clients and Space Management (HSM) clients
  • IBM GSS (GPFS Storage Server) also uses a pair of x3650 servers, running RHEL

At one point the DS8000 was available with LPAR separation into two storage servers (intended to cater to a split production/non-production environment) and there was talk at the time of the possibility of other apps such as TSM being able to be loaded onto an LPAR (a feature that was never released).

Apps or features?: There are a bunch of apps that could be run on storage systems, and in fact many already are, except they are usually called ‘features’ rather than apps. The clearest examples are probably in the NAS world, where TSM and Space Management and SAMBA/CTDB and Ganesha/NFS, and maybe LTFS, for example, could all be treated as features.

I also recall Netapp once talking about a Fujitsu-only implementation of ONTAP that could be run in a VM on a blade server, and EMC has talked up the possibility of running apps on storage.

GPFS: In my last post I illustrated an example of using IBM’s GPFS to construct a server-based shared storage system. The challenge with these kinds of systems is that they put onus onto the installer/administrator to get it right, rather than the traditional storage appliance approach where the vendor pre-constructs the system.

Virtualization: Reliability and supportability are vital, but virtualization does allow the possibility that we could have ring-fenced partitions for core storage functions and still provide server capacity for a range of other data-oriented functions e.g. MapReduce, Hadoop, OpenStack Cinder & Swift, as well as apps like TSM and HSM, and maybe even things like compression, dedup, anti-virus, LTFS etc., but treated not so much as storage system features, but more as genuine apps that you can buy from 3rd parties or write yourself, just as you would with traditional apps on servers.

The question is not so much ‘can this be done’, but more, ‘is it a good thing to do’? Would it be a good thing to open up storage systems and expose the fact that these are truly software-defined systems running on servers, or does that just make support harder and add no real value (apart from providing a new fashion to follow in a fashion-driven industry)? My guess is that there is a gradual path towards a happy medium to be explored here.

IBM FlashSystem 840 for Legacy-free Flash

Flash storage is at an interesting place and it’s worth taking the time to understand IBM’s new FlashSystem 840 and how it might be useful.

A traditional approach to flash is to treat it like a fast disk drive with a SAS interface, and assume that a faster version of traditional systems are the way of the future. This is not a bad idea, and with auto-tiering technologies this kind of approach was mastered by the big vendors some time ago, and can be seen for example in IBM’s Storwize family and DS8000, and as a cache layer in the XIV. Using auto-tiering we can perhaps expect large quantities of storage to deliver latencies around 5 millseconds, rather than a more traditional 10 ms or higher (e.g. MS Exchange’s jetstress test only fails when you get to 20 ms).

No SSDs 3

Some players want to use all SSDs in their disk systems, which you can do with Storwize for example, but this is again really just a variation on a fairly traditional approach and you’re generally looking at storage latencies down around one or two millseconds. That sounds pretty good compared to 10 ms, but there are ways to do better and I suspect that SSD-based systems will not be where it’s at in 5 years time.

The IBM FlashSystem 840 is a little different and it uses flash chips, not SSDs. It’s primary purpose is to be very very low latency. We’re talking as low as 90 microseconds write, and 135 microseconds read. This is not a traditional system with a soup-to-nuts software stack. FlashSystem has a new Storwize GUI, but it is stripped back to keep it simple and to avoid anything that would impact latency.

This extreme low latency is a unique IBM proposition, since it turns out that even when other vendors use MLC flash chips instead of SSDs, by their own admission they generally still end up with latency close to 1 ms, presumably because of their controller and code-path overheads.

FlashSystem 840

  • 2u appliance with hot swap modules, power and cooling, controllers etc
  • Concurrent firmware upgrade and call-home support
  • Encryption is standard
  • Choice of 16G FC, 8G FC, 40G IB and 10G FCoE interfaces
  • Choice of upgradeable capacity
Nett of 2-D RAID5 4 modules 8 modules 12 modules
2GB modules 4 TB 12 TB 20 TB
4GB modules 8 TB 24 TB 40 TB
  • Also a 2 TB starter option with RAID0
  • Each module has 10 flash chips and each chip has 16 planes
  • RAID5 is applied both across modules and within modules
  • Variable stripe RAID within modules is self-healing

I’m thinking that prime targets for these systems include Databases and VDI, but also folks looking to future-proof their general performance. If you’re making a 5 year purchase, not everyone will want to buy a ‘mature’ SSD legacy-style flash solution, when they could instead buy into a disk-free architecture of the future.

But, as mentioned, FlashSystem does not have a full traditional software stack, so let’s consider the options if you need some of that stuff:

  • IMHO, when it comes to replication, databases are usually best replicated using log shipping, Oracle Data Guard etc.
  • VMware volumes can be replicated with native VMware server-based tools.
  • AIX volumes can be replicated using AIX Geographic Mirroring.
  • On AIX and some other systems you can use logical volume mirroring to set up a mirror of your volumes with preferred read set to the FlashSystem 840, and writes mirrored to a V7000 or (DS8000 or XIV etc), thereby allowing full software stack functions on the volumes (on the V7000) without slowing down the reads off the FlashSystem.
  • You can also virtualize FlashSystem behind SVC or V7000
  • Consider using Tivoli Storage Manager dedup disk to disk to create a DR environment

Right now, FlashSystem 840 is mainly about screamingly low latency and high performance, with some reasonable data center class credentials, and all at a pretty good price. If you have a data warehouse, or a database that wants that kind of I/O performance, or a VDI implementation that you want to de-risk, or a general workload that you want to future-proof, then maybe you should talk to IBM about FlashSystem 840.

Meanwhile I suggest you check out these docs:

IBM XIV Gen3 and SPC-1

IBM has just published an SPC-1 benchmark result for XIV. The magic number is 180,020 low latency IOPS in a single rack. This part of my blog post was delayed by my waiting for the official SPC-1 published document so I could focus in on an aspect of SPC-1 that I find particularly interesting.

XIV has always been a work horse rather than a race horse, being fast enough, and beating other traditional systems by never going out of tune, but 180,020 is still a lot of IOPS in a single rack.

SPC-1 has been criticised occasionally as being a drive-centric benchmark, but it’s actually more true to observe that many modern disk systems are drive-centric (XIV is obviously not one of those). Things do change and there was a time in the early 2000’s when, as I recall, most disk systems were controller-bound, and as systems continue to evolve I would expect SPC-1 to continue to expose some architectural quirks, and some vendors will continue to avoid SPC-1 so that their quirks are not exposed.

For example, as some vendors try to scale their architectures, keeping latency low becomes a challenge, and SPC-1 reports give us a lot more detail than just the topline IOPS number if we care to look.

The SPC-1 rules allow average response times up to 30 milliseconds, but generally I would plan real-world solutions around an upper limit of 10 milliseconds average, and for tier1 systems you might sometimes even want to design for 5 milliseconds.

I find read latency interesting because not only does SPC-1 allow for a lot of variance, but different architectures do seem to give very different results. Write latency on the other hand seems to stay universally low right up until the end. Let’s use the SPC-1 reports to look at how some of these systems stack up to my 5 millisecond average read latency test:

DS8870 – this is my baseline as a low-latency, high-performance system

  • 1,536 x 15KRPM drives RAID10 in four frames
  • 451,000 SPC-1 IOPS
  • Read latency hits 4.27 milliseconds at 361,000 IOPS

HP 3PAR V800

  • 1,920 x 15KRPM drives RAID10 in seven frames [sorry for reporting this initially as 3,840 – I was counting the drives and also the drive support package for the same number of drives]
  • 450,000 SPC-1 IOPS
  • Average read latency hits 4.23 millsconds at only 45,000 IOPS

Pausing for a moment to compare DS8870 with 3PAR V800 you’d have to say DS8870 is clearly in a different class when it comes to read latency.

Hitachi VSP

  • 1,152 x 15KRPM drives RAID10 in four frames
  • 270,000 SPC-1 IOPS
  • Average read latency hits 3.76 ms at only 27,000 IOPS and is well above 5 ms at 135,000

Hitachi HUS-VM

  • 608 x 15KRPM drives RAID10 in two frames
  • 181,000 SPC-1 IOPS
  • Average read latency hits 3.72 ms at only 91,000 IOPS and is above 5 ms at 145,000

Netapp FAS3270A

  • 2 x 512GB Flash Cache
  • 120 x 15KRPM drives RAID-DP in a single frame
  • 68,034 SPC-1 IOPS
  • Average read latency hits 2.73 ms at 34,000 IOPS and is well over 6 ms at 54,000

So how does XIV stack up?

  • 15 x 400GB Flash Cache
  • 180 x 7200RPM drives RAID-X in a single frame
  • 180,020 SPC-1 IOPS
  • Average read latency hits 4.08 millseconds at 144,000 IOPS

And while I know that there are many ways to analyse and measure the value of things, it is interesting that the two large IBM disk systems seem to be the only ones that can keep read latency down below 5 ms when they are heavily loaded.

[SPC-1 capacity data removed on 130612 as it wasn’t adding anything, just clutter]

Update 130617: I have just found another comment from HP in my spam filter, pointing out that the DS8870 had 1,536 drives not 1,296. I will have to remember not to write in a such a rush next time. This post was really just an add-on to the more important  first half of the post on the new XIV features, and was intended to celebrate the long-awaited SPC-1 result from the XIV team.

What do you get at an IBM Systems Technical Symposium?

What do you get at an IBM Systems Technical Symposium? Well for the event in Auckland, New Zealand November 13-15 I’ve tried to make the storage content as interesting as possible. If you’re interested in attending, send me an email at jkelly@nz.ibm.com and I will put you in contact with Jacell who can help you get registered. There is of course content from our server teams as well, but my focus has been on the storage content, planned as follows:

Erik Eyberg, who has just joined IBM in Houston from Texas Memory Systems following IBM’s recent acquisition of TMS, will be presenting “RAMSAN – The World’s Fastest Storage”. Where does IBM see RAMSAN fitting in and what is the future of flash? Check out RAMSAN on the web, on twitter, on facebook and on youtube.

Fresh from IBM Portugal and recently transferred to IBM Auckland we also welcome Joao Almeida who will deliver a topic that is sure to be one of the highlights, but unfortunately I can’t tell you what it is since the product hasn’t been announced yet (although if you click here you might get a clue).

Zivan Ori, head of XIV software development in Israel knows XIV at a very detailed level – possibly better than anyone, so come along and bring all your hardest questions! He will be here and presenting on:

  • XIV Performance – What you need to know
  • Looking Beyond the XIV GUI

John Sing will be flying in from IBM San Jose to demonstrate his versatility and expertise in all things to do with Business Continuance, presenting on:

  • Big Data – Get IBM’s take on where Big Data is heading and the challenges it presents and also how some of IBM’s products are designed to meet that challenge.
  • ProtecTIER Dedup VTL options, sizing and replication
  • Active/Active datacentres with SAN Volume Controller Stretched Cluster
  • Storwize V7000U/SONAS Global Active Cloud Engine multi-site file caching and replication

Andrew Martin will come in from IBM’s Hursley development labs to give you the inside details you need on three very topical areas:

  • Storwize V7000 performance
  • Storwize V7000 & SVC 6.4 Real-time Compression
  • Storwize V7000 & SVC Thin Provisioning

Senaka Meegama will be arriving from Sydney with three hot topics around VMware and FCoE:

  • Implementing SVC & Storwize V7000 in a VMware Environment
  • Implementing XIV in a VMware Environment
  • FCoE Network Design with IBM System Storage

Jacques Butcher is also coming over from Australia to provide the technical details you all crave on Tivoli storage management:

  • Tivoli FlashCopy Manager 3.2 including Vmware Integration
  • TSM for Virtual Environments 6.4
  • TSM 6.4 Introduction and Update plus TSM Roadmap for 2013

Maurice McCullough will join us from Atlanta, Georgia to speak on:

  • The new high-end DS8870 Disk System
  • XIV Gen3 overview and tour

Sandy Leadbeater will be joining us from Wellington to cover:

  • Storwize V7000 overview
  • Scale-Out NAS and V7000U overview

I will be reprising my Sydney presentations with updates:

  • Designing Scale Out NAS & Storwize V7000 Unified Solutions
  • Replication with SVC and Storwize V7000

And finally, Mike McKenzie will be joining us from Brocade in Australia to give us the skinny on IBM/Brocade FCIP Router Implementation.

FCIP Routers – A Best Practice Design Tip

Many years ago a Glaswegian friend of mine quoted someone as saying that the 1981 anti-apartheid protests in New Zealand (South African rugby tour) showed that New Zealand was not just a floating Surrey as some had previously suspected. While the Surrey reference might be lost on those not from England, I can tell you there are some distinct cultural and language differences between NZ and England.

For example, there was a (not very good) punk band called ‘Rooter’ back in the late 1970’s in New Zealand. They ended up having to change their name to The Terrorways because ‘Rooter’ was  considered too offensive by the managers of many pubs and clubs.

I guess that’s why in NZ we always pronounce ‘router’ to rhyme with ‘shouter’ even though we pronounce ‘route’ to rhyme with ‘shoot’. We’re kind of stuck in the middle between British and American English.

Pronunciation issues aside however, FCIP routers are a highly reliable way to connect fabrics and allow replication over the WAN between fibre channel disk systems. The price of FCIP routers seems to have halved over the last year or so, which is handy and live replicated DR sites have become much more commonplace in the midrange space in the last couple of years.

Apart from the WAN itself (which is the source of most replication problems) there are a couple of other things that it’s good to be aware of when assembling a design and bill of materials for FCIP routers.

  1. When you’re using the IBM SAN06B-R (Brocade 7800) we always recommend including the licence for ‘Integrated Routing’ if you’re going out over the WAN. This prevents the fabrics at either end of an FCIP link from merging. If a WAN link bounces occasionally as many do, you want to protect your fabrics from repeatedly having to work out who’s in charge and stalling traffic on the SAN while they do that. Without IR your WAN FCIP environment might not really even be supportable.
  2. Similarly I usually recommend the ‘Advanced Performance Monitoring’ feature. If you run into replication performance problems APM will tell you what the FC app is actually seeing rather than you having to make assumptions based on IP network tools.
  3. The third point is new to me and was the real trigger for this blog post (thanks to Alexis Giral for his expertise in this area) and that is if you have only one router per site (as most do) then best practice is to connect only one fabric at each site as per the diagram below.

The reason for this is that the routers and the switches all run the same FabricOS and there is a small potential for an error to be propagated across fabrics, even though Integrated Routing supposedly isolates the fabrics. This is something that Alexis tells me he has explored in detail with Brocade and they too recommend this as a point of best practice. If you already have dual-fabric connected single routers then I’m not sure the risk is high enough to warrant a reconfiguration, but if you’re starting from scratch you should not connect them all up. This would also apply if you are using Cisco MDS922i and MDS91xx for example, as all switches and routers would be running NXOS and the same potential for error propagation exists.

Easy Tier is even better than we thought!

IBM storage architects and IBM Business Partners are encouraged to use Disk Magic to model performance when recommending disk systems to meet a customer requirement. Recently v9.1 of Disk magic was released and it listed nine changes from v9. This little gem was one of them:

“The Easy Tier predefined Skew Levels have been updated based on recent measurements.”

Knowing that sometimes low-key mentions like this can actually be quite significant, I thought I’d check it out.

It turns out that v9 had three settings

  • low skew (2)
  • medium skew (3.5)
  • heavy skew (7)

While v9.1 has

  • very low (2)
  • low (3.5)
  • intermediate (7)
  • high (14)
  • very high (24)

If I take a model that I did recently for Storwize V7000 customer:

  • 40 x 450GB 10K 2.5″ drives RAID5
  • 5 x 200GB SSDs RAID5
  • plus hot spares
  • 16KB I/O size
  • 70/30 read/write ratio

The v9 predictions were:

  • 12,000 IOPS at light skew (2)
  • 13,000 IOPS at medium skew (3.5)
  • 17,000 IOPS at heavy skew (7)

I have generally used medium skew (3.5) when doing general sizing, but the help section in Disk Magic now says “In order to get a realistic prediction, we recommend using the high skew (14) option for most typical environments.  Use the intermediate skew level (7) for a more conservative sizing.”

The v9.1 predictions are now:

  • 12,000 IOPS at very low (2)
  • 13,000 IOPS at low (3.5)
  • 17,000 IOPS at intermediate (7)
  • 28,000 IOPS at high (14)
  • 52,000 IOPS at very high (24)

So what we can see from this is that the performance hasn’t changed for a given skew, but what was previously considered heavy skew is now classed as intermediate. It seems that field feedback is that I/Os are more heavily skewed towards a fairly small working set as a percentage of the total data. Easy Tier is therefore generally more effective than we had bargained on. So apparently I have been under-estimating Easy Tier by a considerable margin (the difference between 13,000 IOPS and 28,000 IOPS in this particular customer example).

The Disk Magic help also provides this graph to show how the skew relates to real life. “In this chart the intermediate skew curve (the middle one) indicates that for a fast tier capacity of 20%, Easy Tier would move 79% of the Workload (I/Os) to the fast tier.”

For more reading on Easy Tier see the following:

XIV Gen3 Sequential Performance

Big Data can take a variety of forms but what better way to get a feeling for the performance of a big data storage system than using a standard audited benchmark to measure large file processing, large query processing, and video streaming.

From the www.storageperformance.org website:

“SPC-2 consists of three distinct workloads designed to demonstrate the performance of a storage subsystem during… large-scale, sequential movement of data…

  • Large File Processing: Applications… which require simple sequential process of one or more large files such as scientific computing and large-scale financial processing.
  • Large Database Queries: Applications that involve scans or joins of large relational tables, such as those performed for data mining or business intelligence.
  • Video on Demand: Applications that provide individualized video entertainment to a community of subscribers by drawing from a digital film library.”

The Storage Performance Council also recently published its first SPC-2E benchmark result. “The SPC-2/E benchmark extension consists of the complete set of SPC-2 performance measurement and reporting plus the measurement and reporting of energy use.”

It uses the same performance test as the SPC-2 so the results can be compared. It does look as though only IBM and Oracle are publishing SPC-2 numbers these days however and the IBM DS5300 and DS5020 are the same LSI OEM boxes as the Oracle 6780 and 6180, so that doesn’t really add a lot to the mix. HP and HDS seem to have fled some time ago, and although Fujitsu and Texas Memory do publish, I have never encountered either of those systems out in the market. So the SPC-2 right now is mainly a way to compare sequential performance among IBM systems.

XIV is certainly interesting, because in its Generation 2 format it was never marketed as a box for sequential or single-threaded workloads. XIV Gen2 was a box for random workloads, and the more random and mixed the workload the better it seemed to be. With XIV Generation 3 however we have a system that is seen to be great with sequential workloads, especially Large File Processing, although not quite so strong for Video on Demand.

The distinguishing characteristic of LFP is that it is a read/write workload, while the others appear to be read-only. XIV’s strong write performance comes through on the LFP benchmark.

Drilling down one layer deeper we can look at the components that make up Large File Processing. Sub-results are reported for reads, writes, and mixed read/write, as well as for 256 KiB and 1,024 KiB I/O sizes in each category.

So what we see is that XIV is actually slightly faster than DS8800 on the write workloads, but falls off a little when the read percentage of the I/O mix is higher.

%d bloggers like this: