Comprestimator Guesstimator

Hey folks, just a quick post for you based on recent experience of IBM’s NAS Comprestimator utility for Storwize V7000 Unified where it completely failed to predict an outcome that I had personally predicted 100% accurately, based on common sense. The lesson here is that you should read the NAS Comprestimator documentation very carefully before you trust it (and once you read and understand it you’ll realize that there are some situations in which you simply cannot trust it).data-swamp

We all know that Comprestimator is a sampling tool right? It looks at your actual data and works out the compression ratio you’re likely to get… well, kind of…

Let’s look first at the latest IBM spiel at https://www-304.ibm.com/webapp/set2/sas/f/comprestimator/home.html

“The Comprestimator utility uses advanced mathematical and statistical algorithms to perform the sampling and analysis process in a very short and efficient way.”

Cool, advanced mathematical and statistical algorithms – sounds great!

But there’s a slightly different story told on an older page that is somewhat more revealing http://m.ibm.com/http/www14.software.ibm.com/webapp/set2/sas/f/comprestimator/NAS_Compression_estimation_utility.html

“The NAS Compression Estimation Utility performs a very efficient and quick listing of file directories. The utility analyzes file-type distribution information in the scanned directories, and uses a pre-defined list of expected compression rates per filename extension. After completing the directory listing step the utility generates a spreadsheet report showing estimated compression savings per each file-type scanned and the total savings expected in the environment.

It is important to understand that this utility provides a rough estimation based on typical compression rates achieved for the file-types scanned in other customer and lab environments. Since data contained in files is diverse and is different between users and applications storing the data, actual compression achieved will vary between environments. This utility provides a rough estimation of expected compression savings rather than an accurate prediction.

The difference here is that one is for NAS and one is for block, but I’m assuming that the underlying tool is the same. So, what if you have a whole lot of files with no extension? Apparently Comprestimator then just assumes 50% compression.

Below I reveal the reverse-engineered source code for the NAS Comprestimator when it comes to assessing files with no extension, and I release this under an Apache licence. Live Free or Die people.

#include<stdio.h>

int main()
{
printf(“IBM advanced mathematical and statistical algorithms predict the following compression ratio: 50% \n”);
return 0;
}

enjoy : )

 

 

Panzura – Distributed Locking & Cloud Gateway for CAD

I have been watching the multi-site distributed NAS space for some years now. There have been some interesting products including Netapp’s Flexcache which looked nice but never really seemed to get market traction, and similarly IBM Global Active Cloud Engine (Panache) which was released as a feature of SONAS and Storwize V7000 Unified. Microsoft have played on the edge of this field more successfully with DFS Replication although that does not handle locking. Other technologies that encroach on this space are Microsoft Sharepoint and also WAN acceleration technologies like Microsoft Branchcache and Riverbed.

What none of these have been very good at however is solving the problem of distributed collaborative authoring of large complex multi-layered documents with high performance and sturdy locking. For example cross-referenced CAD drawings.

Panzura

It’s no surprise that the founders of Panzura came from a networking background (Aruba, Alteon) since the issues to be solved are those that are introduced by the network. Panzura is a global file system tuned for CAD files and it’s not unusual to see Panzura sites experience file load times less than one tenth or sometimes even one hundredth of what they were prior to Panzura being deployed.

Rather than just provide efficient file locking however, Panzura has taken the concept to the Cloud, so that while caching appliances can be deployed to each work site, the main data repository can be in Amazon S3 or Azure for example. Panzura now claims to be the only proven global file locking solution that solves cross-site collaboration issues of applications like Revit, AutoCAD, Civil3D, and Bentley MicroStation as well as SOLIDWORKS CAD and Siemens NX PLM applications. The problems of collaboration in these environments are well-known to CAD users.

Survey

Panzura has been growing rapidly, with 400% revenue growth in 2013 and they have just come off another record quarter and a record year for 2014. Back in 2013 they decided to focus their energies on the Architectural, Engineering & Construction (so-called AEC) markets since that was where the technology delivered the greatest return on customer investment. In that space they have been growing more than 1000% per year.

ViFX recently successfully supplied Panzura to an international engineering company based in New Zealand. If you have problems with shared CAD file locking, please contact ViFX to see how we can solve the problem using Panzura.

cad-vs-revit

Storage Spaghetti Anyone?

I recall Tom West (Chief Scientist at Data General, and star of Soul of a New Machine) once saying to me when he visited New Zealand that there was an old saying “Hardware lasts three years, Operating Systems last 20 years, but applications can go on forever.”

Over the years I have known many application developers and several development managers, and one thing that they seem to agree on is that it is almost impossible to maintain good code structure inside an app over a period of many years. The pressures of deadlines for features, changes in market, fashion and the way people use applications, the occasional weak programmer, and the occasional weak dev manager, or temporary lapse in discipline due to other pressures all contribute to fragmentation over time. It is generally by this slow attrition that apps end up being full of structural compromises and the occasional corner that is complete spaghetti.

I am sure there are exceptions, and there can be periodic rebuilds that improve things, but rebuilds are expensive.

If I think about the OS layer, I recall Data General rebuilding much of their DG/UX UNIX kernel to make it more structured because they considered the System V code to be pretty loose. Similarly IBM rebuilt UNIX into a more structured AIX kernel around the same time, and Digital UNIX (OSF/1) was also a rebuild based on Mach. Ironically HPUX eventually won out over Digital UNIX after the merger, with HPUX rumoured to be the much less structured product, a choice that I’m told has slowed a lot of ongoing development. Microsoft rebuilt Windows as NT and Apple rebuilt Mac OS to base it on the Mach kernel.

So where am I heading with this?

Well I have discussed this topic with a couple of people in recent times in relation to storage operating systems. If I line up some storage OS’s and their approximate date of original release you’ll see what I mean:

Netapp Data ONTAP 1992 22 years
EMC VNX / CLARiiON 1993 21 years
IBM DS8000 (assuming ESS code base) 1999 15 years
HP 3PAR 2002 12 years
IBM Storwize 2003 11 years
IBM XIV / Nextra 2006 8 years
Nimble Storage 2010 4 years

I’m not trying to suggest that this is a line-up in reverse order of quality, and no doubt some vendors might claim rebuilds or superb structural discipline, but knowing what I know about software development, the age of the original code is certainly a point of interest.

With the current market disruption in storage, cost pressures are bound to take their toll on development quality, and the problem is amplified if vendors try to save money by out-sourcing development to non-integrated teams in low-cost countries (e.g. build your GUI in Romania, or your iSCSI module in India).

Spaghetti

IBM Software-defined Storage

The phrase ‘Software-defined Storage’ (SDS) has quickly become one of the most widely used marketing buzz terms in storage. It seems to have originated from Nicira’s use of the term ‘Software-defined Networking’ and then adopted by VMware when they bought Nicira in 2012, where it evolved to become the ‘Software-defined Data Center’ including ‘Software-defined Storage’. VMware’s VSAN technology therefore has the top of mind position when we are talking about SDS. I really wish they’d called it something other than VSAN though, so as to avoid the clash with the ANSI T.11 VSAN standard developed by Cisco.

I have seen IBM regularly use the term ‘Software-defined Storage’ to refer to:

  1. GPFS
  2. Storwize family (which would include FlashSystem V840)
  3. Virtual Storage Center / Tivoli Storage Productivity Center

I recently saw someone at IBM referring to FlashSystem 840 as SDS even though to my mind it is very much a hardware/firmware-defined ultra-low-latency system with a very thin layer of software so as to avoid adding latency.

Interestingly, IBM does not seem to market XIV as SDS, even though it is clearly a software solution running on commodity hardware that has been ‘applianced’ so as to maintain reliability and supportability.

Let’s take a quick look at the contenders:

1. GPFS: GPFS is a file system with a lot of storage features built in or added-on, including de-clustered RAID, policy-based file tiering, snapshots, block replication, support for NAS protocols, WAN caching, continuous data protection, single namespace clustering, HSM integration, TSM backup integration, and even a nice new GUI. GPFS is the current basis for IBM’s NAS products (SONAS and V7000U) as well as the GSS (gpfs storage server) which is currently targeted at HPC markets but I suspect is likely to re-emerge as a more broadly targeted product in 2015. I get the impression that gpfs may well be the basis of IBM’s SDS strategy going forward.

2. Storwize: The Storwize family is derived from IBM’s SAN Volume Controller technology and it has always been a software-defined product, but tightly integrated to hardware so as to control reliability and supportability. In the Storwize V7000U we see the coming together of Storwize and gpfs, and at some point IBM will need to make the call whether to stay with the DS8000-derived RAID that is in Storwize currently, or move to the gpfs-based de-clustered RAID. I’d be very surprised if gpfs hasn’t already won that long-term strategy argument.

3. Virtual Storage Center: The next contender in the great SDS shootout is IBM’s Virtual Storage Center and it’s sub-component Tivoli Storage Productivity Center. Within some parts of IBM, VSC is talked about as the key to SDS. VSC is edition dependent but usually includes the SAN Volume Controller / Storwize code developed by IBM Systems and Technology Group, as well as the TPC and FlashCopy Manager code developed by IBM Software Group, plus some additional TPC analytics and automation. VSC gives you a tremendous amount of functionality to manage a large complex site but it requires real commitment to secure that value. I think of VSC and XIV as the polar opposites of IBM’s storage product line, even though some will suggest you do both. XIV drives out complexity based on a kind of 80/20 rule and VSC is designed to let you manage and automate a complex environment.

Commodity Hardware: Many proponents of SDS will claim that it’s not really SDS unless it runs on pretty much any commodity server. GPFS and VSC qualify by this definition, but Storwize does not, unless you count the fact that SVC nodes are x3650 or x3550 servers. However, we are already seeing the rise of certified VMware VSAN-ready nodes as a way to control reliability and supportability, so perhaps we are heading for a happy medium between the two extremes of a traditional HCL menu and a fully buttoned down appliance.

Product Strategy: While IBM has been pretty clear in defining its focus markets – Cloud, Analytics, Mobile, Social, Security (the ‘CAMSS’ message that is repeatedly referred to inside IBM) I think it has been somewhat less clear in articulating a clear and consistent storage strategy, and I am finding that as the storage market matures, smart people are increasingly wanting to know what the vendors’ strategies are. I say vendors plural because I see the same lack of strategic clarity when I look at EMC and HP for example. That’s not to say the products aren’t good, or the roadmaps are wrong, but just that the long-term strategy is either not well defined or not clearly articulated.

It’s easier for new players and niche players of course, and VMware’s Software-defined Storage strategy, for example, is both well-defined and clearly articulated, which will inevitably make it a baseline for comparison with the strategies of the traditional storage vendors.

A/NZ STG Symposium: For the A/NZ audience, if you want to understand IBM’s SDS product strategy, the 2014 STG Tech Symposium in August is the perfect opportunity. Speakers include Sven Oehme from IBM Research who is deeply involved with gpfs development, Barry Whyte from IBM STG in Hursley who is deeply involved in Storwize development, and Dietmar Noll from IBM in Frankfurt who is deeply involved in the development of Virtual Storage Center.

Melbourne – August 19-22

Auckland – August 26-28

IBM Storwize V7000 RtC: “Freshly Squeezed” Revisited

Back in 2012 after IBM announced Real-time Compression (RtC) for Storwize disk systems I covered the technology in a post entitled “Freshly Squeezed“. The challenge with RtC in practice turned out to be that on many workloads it just couldn’t get the CPU resources it needed, and I/O rates were very disappointing, especially in its newly-released un-tuned state,

We quickly learned that lesson and IBM’s Disk Magic was an essential tool to warn us aboout unsuitable workloads. Even in August 2013 when I was asked at the Auckland IBM STG Tech Symposium “Do you recommend RtC for general use?” My answer was “Wait until mid 2014”.

Now that the new V7000 (I’m not sure we’re supposed to call it Gen2, but that works for me) is out, I’m hoping that time has come.

The New V7000: I was really impressed when we announced the new V7000 in May 2014 with it’s 504 disk drives, faster CPUs, 2 x RtC (Intel Coleto Creek comms encryption processor) offload engines per node canister, and extra cache resources (up to 64GB RAM per node canister, of which 36GB is dedicated to RtC) but having been caught out in 2012, I wanted to see what Disk Magic had to say about it before I started recommending it to people. That’s why this post has taken until now to happen – Disk Magic 9.16.0 has just been released.

Coleto Creek RtC offload engine:Coleto Creek

After a quick look at Disk Magic I almost titled this post “Bigger, Better, Juicier than before” but I felt I should restrain myself a little, and there are still a few devils in the details.

50% Extra: I have been working on the conservative assumption of getting an extra 50% nett space from RtC across an entire disk system if little was known about the data. It is best to run IBM’s Comprestimator so you can get a better picture if you have access to do that however.

Getting an extra 50% is the same as setting Capacity Magic to use 33% compression. Until now I believed that this was a very conservative position, but one thing I really don’t enjoy is setting an expectation and then being unable to deliver on it.

Easy Tier: The one major deficiency in Disk Magic 9.16.0 is that you can’t model Easy Tier and RtC in the same model. That is pretty annoying since on the new V7000 you will almost certainly want both. So unfortunately that means Disk Magic 9.16.0 is still a bit of a waste of time in testing most real-life configurations that include RtC and the real measure will have to wait until the next release due in August 2014.

What you can use 9.16.0 however is to validate the performance of RtC (without Easy Tier) and look at the usage on the new offload engines. What I found was that the load on the RtC engines is still very dependent on the I/O size.

I/O Size: When I am doing general modelling I used to use 16KB as a default size since that is the kind of figure I had generally seen in mixed workload environments, but in more recent times I have gone back to using the default of 4KB since the automatic cache modelling in Disk Magic takes a lot of notice of the I/O size when deciding how random the workload is likely to be. Using 4KB forces Disk Magic to assume that the workload is very random, and that once again builds in some headroom (all part of my under-promise+over-deliver strategy). If you use 16KB, or even 24KB as I have seen in some VMware environments, then Disk Magic will assume there are a lot of sequential I/Os and I’m not entirely comfortable with the huge modeled performance improvement you get from that assumption. (For the same reason these days I tend to model Easy Tier using the ‘Intermediate’ setting rather than the default/recommended ‘High Skew’ setting.)

However, using a small I/O size in your Disk Magic modelling has the exact opposite effect when modelling RtC. RtC runs really well when the I/O size is small, and not so well when the I/O size is large. So my past conservative practice of modelling a small I/O size might not be so conservative when it comes to RtC.

Different Data Types: In the past I have also tended to build Disk Magic models with one server, this is because my testing showed that having several servers or a single server gave the same result. All Disk Magic cared about was the number of I/O requests coming in over a given number of fibres. Now however we might need to take more careful account of data types and focus less on the overall average I/O size and more on the individual workloads and which are suitable for RtC and which are not.

50% Busy: And just as we should all be aware that going over 50% busy on a dual controller system is a recipe for problems should we lose a controller for any reason (and faults are also more likely to happen when the system is being pushed hard) similarly going over 50% busy on your Coleto Creek RtC offload engines would also lead to problems if you lose a controller.

I always recommend that you use all 4 compression engines +extra cache on each dual controller V7000 and now I’m planning to work on the assumption that, yes I can get 1.5:1 compression overall, but that is more likely to come from 50% being without compression and 50% being at 2:1 compression and my Disk Magic models will reflect that. So I still expect to need 66% physical nett to get to 100% target, but I’m now going to treat each model as being made up of at least two pools, one compressed and one not.

Transparent Compression: RtC on the new Gen2 V7000 is a huge improvement over the Gen1 V7000. The hardware has been specifically designed to support it, and remember that it is truly transparent and doesn’t lose compression over time or require any kind of batch processing. That all goes to make it a very nice technology solution that most V7000 buyers should take advantage of.

My name is Storage and I’ll be your Server tonight…

Ever since companies like Data General moved RAID control into an external disk sub-system back in the early ’90s it has been standard received knowledge that servers and storage should be separate.

While the capital cost of storage in the server is generally lower than for an external centralised storage subsystem, having storage as part of each server creates fragmentation and higher operational management overhead. Asset life-cycle management is also a consideration – servers typically last 3 years and storage can often be sweated for 5 years since the pace of storage technology change has traditionally been slower than for servers.

When you look at some common storage systems however, what you see is that they do include servers that have been ‘applianced’ i.e. closed off to general apps, so as to ensure reliability and supportability.

  • IBM DS8000 includes two POWER/AIX servers
  • IBM SAN Volume Controller includes two IBM SystemX x3650 Intel/Linux servers
  • IBM Storwize is a custom variant of the above SVC
  • IBM Storwize V7000U includes a pair of x3650 file heads running RHEL and Tivoli Storage Manager (TSM) clients and Space Management (HSM) clients
  • IBM GSS (GPFS Storage Server) also uses a pair of x3650 servers, running RHEL

At one point the DS8000 was available with LPAR separation into two storage servers (intended to cater to a split production/non-production environment) and there was talk at the time of the possibility of other apps such as TSM being able to be loaded onto an LPAR (a feature that was never released).

Apps or features?: There are a bunch of apps that could be run on storage systems, and in fact many already are, except they are usually called ‘features’ rather than apps. The clearest examples are probably in the NAS world, where TSM and Space Management and SAMBA/CTDB and Ganesha/NFS, and maybe LTFS, for example, could all be treated as features.

I also recall Netapp once talking about a Fujitsu-only implementation of ONTAP that could be run in a VM on a blade server, and EMC has talked up the possibility of running apps on storage.

GPFS: In my last post I illustrated an example of using IBM’s GPFS to construct a server-based shared storage system. The challenge with these kinds of systems is that they put onus onto the installer/administrator to get it right, rather than the traditional storage appliance approach where the vendor pre-constructs the system.

Virtualization: Reliability and supportability are vital, but virtualization does allow the possibility that we could have ring-fenced partitions for core storage functions and still provide server capacity for a range of other data-oriented functions e.g. MapReduce, Hadoop, OpenStack Cinder & Swift, as well as apps like TSM and HSM, and maybe even things like compression, dedup, anti-virus, LTFS etc., but treated not so much as storage system features, but more as genuine apps that you can buy from 3rd parties or write yourself, just as you would with traditional apps on servers.

The question is not so much ‘can this be done’, but more, ‘is it a good thing to do’? Would it be a good thing to open up storage systems and expose the fact that these are truly software-defined systems running on servers, or does that just make support harder and add no real value (apart from providing a new fashion to follow in a fashion-driven industry)? My guess is that there is a gradual path towards a happy medium to be explored here.

Building Scale-out NAS with IBM Storwize V7000 Unified

If you need scalable NAS and what you’re primarily interested in is capacity scaling, with less emphasis on performance, and more on cost-effective entry price, then you might be interested in building a scale-out NAS from Storwize V7000 Unified systems, especially if you have some block I/O requirements to go with your NAS requirements.

There are three ways that I can think of doing this and each has its strengths. The documentation on these options is not always easy to find, so these diagrams and bullets might help to show what is possible.

One key point that is not well understood is that clustering V7000 systems to a V7000U allows SMB2 access to all of the capacity – a feature that you don’t get if you virtualize secondary systems rather than cluster them.

V7000U Cluster

V7000U Virtualization

V7000U Active Cloud

And of course systems management is made relatively simple with the Storwize GUI.

V7000U

IBM FlashSystem 840 for Legacy-free Flash

Flash storage is at an interesting place and it’s worth taking the time to understand IBM’s new FlashSystem 840 and how it might be useful.

A traditional approach to flash is to treat it like a fast disk drive with a SAS interface, and assume that a faster version of traditional systems are the way of the future. This is not a bad idea, and with auto-tiering technologies this kind of approach was mastered by the big vendors some time ago, and can be seen for example in IBM’s Storwize family and DS8000, and as a cache layer in the XIV. Using auto-tiering we can perhaps expect large quantities of storage to deliver latencies around 5 millseconds, rather than a more traditional 10 ms or higher (e.g. MS Exchange’s jetstress test only fails when you get to 20 ms).

No SSDs 3

Some players want to use all SSDs in their disk systems, which you can do with Storwize for example, but this is again really just a variation on a fairly traditional approach and you’re generally looking at storage latencies down around one or two millseconds. That sounds pretty good compared to 10 ms, but there are ways to do better and I suspect that SSD-based systems will not be where it’s at in 5 years time.

The IBM FlashSystem 840 is a little different and it uses flash chips, not SSDs. It’s primary purpose is to be very very low latency. We’re talking as low as 90 microseconds write, and 135 microseconds read. This is not a traditional system with a soup-to-nuts software stack. FlashSystem has a new Storwize GUI, but it is stripped back to keep it simple and to avoid anything that would impact latency.

This extreme low latency is a unique IBM proposition, since it turns out that even when other vendors use MLC flash chips instead of SSDs, by their own admission they generally still end up with latency close to 1 ms, presumably because of their controller and code-path overheads.

FlashSystem 840

  • 2u appliance with hot swap modules, power and cooling, controllers etc
  • Concurrent firmware upgrade and call-home support
  • Encryption is standard
  • Choice of 16G FC, 8G FC, 40G IB and 10G FCoE interfaces
  • Choice of upgradeable capacity
Nett of 2-D RAID5 4 modules 8 modules 12 modules
2GB modules 4 TB 12 TB 20 TB
4GB modules 8 TB 24 TB 40 TB
  • Also a 2 TB starter option with RAID0
  • Each module has 10 flash chips and each chip has 16 planes
  • RAID5 is applied both across modules and within modules
  • Variable stripe RAID within modules is self-healing

I’m thinking that prime targets for these systems include Databases and VDI, but also folks looking to future-proof their general performance. If you’re making a 5 year purchase, not everyone will want to buy a ‘mature’ SSD legacy-style flash solution, when they could instead buy into a disk-free architecture of the future.

But, as mentioned, FlashSystem does not have a full traditional software stack, so let’s consider the options if you need some of that stuff:

  • IMHO, when it comes to replication, databases are usually best replicated using log shipping, Oracle Data Guard etc.
  • VMware volumes can be replicated with native VMware server-based tools.
  • AIX volumes can be replicated using AIX Geographic Mirroring.
  • On AIX and some other systems you can use logical volume mirroring to set up a mirror of your volumes with preferred read set to the FlashSystem 840, and writes mirrored to a V7000 or (DS8000 or XIV etc), thereby allowing full software stack functions on the volumes (on the V7000) without slowing down the reads off the FlashSystem.
  • You can also virtualize FlashSystem behind SVC or V7000
  • Consider using Tivoli Storage Manager dedup disk to disk to create a DR environment

Right now, FlashSystem 840 is mainly about screamingly low latency and high performance, with some reasonable data center class credentials, and all at a pretty good price. If you have a data warehouse, or a database that wants that kind of I/O performance, or a VDI implementation that you want to de-risk, or a general workload that you want to future-proof, then maybe you should talk to IBM about FlashSystem 840.

Meanwhile I suggest you check out these docs:

Another Storwize Global Mirror Best Practice Tip

Tip: When running production-style workloads alongside Global Mirror continuous replication secondary volumes on one Storwize system, best practice is to put the production and DR workloads into separate pools. This is especially important when the production workloads are write intensive.

Aside from write-intensive OLTP, OLAP etc, large file copies (e.g. zipping a 10GB flat file database export) can be the biggest hogs of write resource (cache and disk), especially where the backend disk is not write optimised (e.g. RAID6).

Write Cache Partitioning

Global Mirror continuous replication requires a fast clean path for writes at the target site. If it doesn’t get that it places heavy demands on the write cache at the target site. If that write cache is already heavily committed it creates back-pressure through Global Mirror through to the source system. However, if you create more than one pool on your Storwize system it will manage quality of service for the write cache on a pool by pool basis:

Pools on your system

Max % of write cache any one pool can use

1

100%

2

66%

3

40%

4

30%

5

25%

RAID6 for Write Intensive Workloads?

If you are thinking of using RAID6 in your Global Mirror continuous replication target pool, you might also want to consider instead using RAID10, or maybe using RAID6 with Easy Tier (SSD assist). As an example, Disk Magic suggests that when comparing the following two options with 100% write workload (16KB I/O size):
  • 10 x 4TB NL-SAS 7200RPM RAID1 (nett 18TiB)
  • 22 x 1200GB SAS 10KRPM 9+2 RAID6 (nett 19TiB)

Not only is the RAID1 option much lower cost, but it is also ~10% faster. I’m not 100% sure we want to encourage folks to use 7200RPM drives at the Global Mirror target side, but the point I’m making is that RAID6 is not really ideal in a 100% write environment. Of course using Easy Tier (SSD assist) can help enormously [added 29th April 2014] in some situations, but not really with Global Mirror targets since the copy grain size is 256KiB and Easy Tier will ignore everything over 64KiB.

Global Mirror with Change Volumes

Global Mirror continuous replication is not synchronous, but typically runs at a lag of less than 100 ms. One way to avoid resource contention issues is to use Global Mirror with Change Volumes (snapshot-based replication) which severs the back-pressure link completely, leaving your production volumes to run free : )

Removing a managed disk non-disruptively from a pool

If however you find yourself in the position of having a workload issue on your Global Mirror target volumes and you want to keep using continuous replication, Storwize allows you to non-disruptively depopulate a managed disk (RAID set) from the pool (assuming you have enough free capacity) so you can create a separate pool from that mdisk.

IBM Storwize 7.2 wins by a SANSlide

So following my recent blog post on SANSlide WAN optimization appliances for use with Storwize replication, IBM has just announced Storwize 7.2 (available December) which includes not only replication natively over IP networks (licensed as Global Mirror/Metro Mirror) but also has SANslide WAN optimization built-in for free. i.e. to get the benefits of WAN optimization you no longer need to purchase Riverbed or Cisco WAAS or SANSlide appliances.

Admittedly, Global Mirror was a little behind the times in getting to a native IP implementation, but having got there, the developers obviously decided they wanted to do it in style and take the lead in this space, by offering a more optimized WAN replication experience than any of our competitors.

The industry problem with TCP/IP latency is the time it takes to acknowledge that your packets have arrived at the other end. You can’t send the next set of packets until you get that acknowledgement back. So on a high latency network you end up spending a lot of your time waiting, which means you can’t take proper advantage of the available bandwidth. Effective bandwidth usage can sometimes be reduced to only 20% of the actual bandwidth you are paying for.

Round trip latency

The first time I heard this story was actually back in the mid-90’s from a telco network engineer. His presentation was entitled something like “How latency can steal your bandwidth”.

SANSlide mitigates latency by virtualising the pipe with many connections. While one connection is waiting for the ACK another is sending data. Using many connections, the pipe can often be filled more than 95%.

SANSlide virtual links

If you have existing FCIP routers you don’t need to rush out and switch over to IP replication with SANSlide, especially if your latency is reasonably low, but if you do have a high latency network it would be worth discussing your options with your local IBM Storwize expert. It might depend on the sophistication of your installed FCIP routers. Brocade for example suggests that the IBM SAN06B-R is pretty good at WAN optimization. So the graph below does not necessarily apply to all FCIP routers.

SANSlide Throughput

When you next compare long distance IBM Storwize replication to our competitors’ offerings, you might want to ask them to include the cost of WAN optimization appliances to get a full apples for apples comparison, or you might want to take into account that with IBM Storwize you will probably need a lot less bandwidth to achieve the same RPO.

Even when others do include products like Riverbed appliances with their offerings, SANSlide still has the advantage of being completely data-agnostic, so it doesn’t get confused or slow down when transmitting encrypted or compressed data like most other WAN optimization appliances do.

Free embedded SANSlide is only one of the cool new things in the IBM Storwize world. The folks in Hursley have been very busy. Check out Barry Whyte’s blog entry and the IBM Storwize product page if you haven’t done so already.

SANSlide WAN Optimization Appliances

WAN optimization is not something that storage vendors traditionally put into their storage controllers. Storage replication traffic has to fend for itself out in the WAN world, and replication performance will usually suffer unless there are specific WAN optimization devices installed in the network.

For example, Netapp recommends Cisco WAAS as:

“an application acceleration and WAN optimization solution that allows storage managers to dramatically improve NetApp SnapMirror performance over the WAN.”

…because:

“…the rated throughput of high-bandwidth links cannot be fully utilized due to TCP behavior under conditions of high latency and high packet loss.”

EMC similarly endorses a range of WAN optimization products including those from Riverbed and Silver Peak.

Back in July, an IBM redpaper entitled “IBM Storwize V7000 and SANSlide Implementation” slipped quietly onto the IBM redbooks site. The redpaper tells us that:

this combination of SANSlide and the Storwize V7000 system provides a powerful solution for clients who require efficient, IP-based replication over long distances.

Bridegworks SANSlide provides WAN optimization, delivering much higher throughput on medium to high latency IP networks. This graph is from the redpaper:

SANSlide improvement

Bridgeworks also advises that:

On the commercial front the company is expanding its presence with OEM partners and building a network of distributors and value-added partners both in its home market and around the world.

Anyone interested in replication using any of the Storwize family (including SVC) should probably check out the redpaper, even if only as a little background reading.

A Quick IBM ProtecTIER (Dedup VTL) Update

This is a very brief update designed to help clarify a few things about IBM’s ProtecTIER dedup VTL solutions. The details of the software functions I will leave to the redbooks (see links below).

What is ProtecTIER?

The dedup algorithm in ProtecTIER is HyperFactor, which detects recurring data in multiple backups. HyperFactor is unique in that it avoids the risk of data corruption due to hash collisions, a risk that is inherent in products based on hashing algorithms. HyperFactor uses a memory resident index, rather than disk-resident hash tables and one consequence of this is that ProtecTIER’s restore times are shorter than backup times, in contrast to other products where restore times are generally much longer.

The amount of space saved is mainly a function of the backup policies and retention periods, and the variance of the data between them, but in general HyperFactor can deliver slightly better dedup ratios than hash-based systems. The more full-backups retained on ProtecTIER, and the more intervening incremental backups, the more space that will be saved overall.

One of the key advantages of ProtecTIER is the ability to replicate deduped data in a many to many grid. ProtecTIER also supports SMB/CIFS and NFS access.

While Tivoli Storage Manager also includes many of the same capabilities as ProtecTIER, the latter will generally deliver higher performance dedup, by offloading the process to a dedicated system, leaving TSM or other backup software to concentrate on selecting and copying files.

For more information on the software functionality etc, please refer to these links:

 

ProtecTIER Systems

In the past IBM has offered three models of ProtecTIER systems, two of which are now withdrawn, and a new one has since appeared.

  • TS7610 (withdrawn) – entry level appliance up to 6 TB and 80 MB/sec.
  • TS7620 – new entry level system. Up to 35 TB of deduped capacity. Backup speed of 300 MB/sec was originally quoted, but with recent capacity increases I am still trying to confirm if the rated throughput has changed.
  • TS7650A (withdrawn) – the midrange appliance which was rated at up to 36 TB and 500 MB/sec. This appliance was based on a back-end IBM (LSI/Netapp) DS4700 disk system with 450GB drives in RAID5 configuration.
  • TS7650G – the enterprise gateway, which is currently rated at 9 TB per hour backup and up to 11.2 TB per hour restore. Each TS7650G has support for multiple Storwize V7000 or XIV disk systems, both of which offer non-disruptive drive firmware update capability.

Sizing

There are a couple of rules of thumb I try to use when doing an initial quick glance sizing with the TS7650G with V7000 disk.

  • Every V7000 disk will give you another 20 GB per hour of ProtecTIER backup throughput. The I/O profile for files is approx 80/20 random R/W with a 60KB block size and we generally use RAID6 for that. Metadata is generally placed on separate RAID10 drives and is more like 20/80 R/W.
  • Backup storage (traditionally on tape) can be five to ten times the production storage capacity, so assuming a 10:1 dedup ratio, you might need a dedup disk repository between half and the same size as your production disk. However, if you know you are already storing x TB of backups on tape, don’t plan on buying less than x/10 dedup capacity. The dedup ratio can sometimes be as high as 25:1 but more typically it will be closer to 10:1.
  • It’s probably not a good idea to buy a dedup system that can’t easily grow to double the sized initial capacity. Dedup capacity is notoriously hard to predict and it can turn out to need more than you expected.

Those rules of thumb are not robust enough to be called a formal sizing, but they do give you a place to start in your thinking.

ProtecTIER

Submit! (to the System Storage Interoperation Center)

When using the IBM System Storage Interoperation Center (SSIC) you need to submit!

I have recently come across two situations where technology combinations appeared to be supported, but there were significant caveats that were not mentioned until the ‘submit’ button was clicked.

The following example finds both of the caveats I have come across this week.

Choose FCoE (or FCoCEE as some refer to it) and VMware 5.1 and IBM Flex Systems, and then Cisco Nexus 5596UP. To simplify the number of options, also choose Midrange Disk and Storwize V7000 Host Attachment categories, and x440 Compute Node.

All good I was thinking – unsupported combinations are not selectable, so the fact that I could select these meant I was safe I thought…

SSIC

What most people seem to neglect to do is hit the submit button. Submit can sometimes bring up a lot more detail including caveats…

SSIC detail

Those with excellent vision might have noted this obscure comment associated with VMware 5.1…

A system call “fsync” may return error or timeout on VMWare guest OS’s and /or Vios LPAR OS’s

A self-funded chocolate fish will be awarded to the first person who can tell me what that actually means (yes I know what fsync is, but what does this caveat actually mean operationally?)

And possibly more important are the two identical comments made on the SAN/switch/networking lines

“Nexus TOR ( Top of Rack) Switch: Must be connected to Supported Cisco MDS Switches.”

i.e. Nexus is only supported with FCoE from the Flex server, if the V7000 itself is attached to a Cisco MDS at the back-end, even though I did not include MDS in the list of technologies I selected on the first page.

So the moral of the story is that you must hit the submit button on the SSIC if you want to get the full support picture.

And as a reminder that compatibility issues usually resolve themselves with time, here is a Dilbert cartoon from twenty one years ago. June 1992.

Dilbert compatibility

IBM FlashSystem: Feeding the Hogs

IBM has announced its new FlashSystem family following on from the acquisition of Texas Memory Systems (RAMSAN) late last year.

The first thing that interests me is where FlashSystem products are likely to play in 2013 and this graphic is intended to suggest some options. Over time the blue ‘candidate’ box is expected to stretch downwards.

Resource hogs

Flash Candidates2

For the full IBM FlashSystem family you can check out the product page at http://www.ibm.com/storage/flash

Probably the most popular product will be the FlashSystem 820, they key characteristics of which are as follows:

Usable capacity options with RAID5

  • 10.3 TB per FlashSystem
  • 20.6 TB per FlashSystem
  • Up to 865 TB usable in a single 42u rack

Latency

  • 110 usec read latency
  • 25 usec write latency

IOPS

  • Up to 525,000 4KB random read
  • Up to 430,000 4KB 70/30 read/write
  • Up to 280,000 4KB random write

Throughput

  • up to 3.3 GB/sec FC
  • up to 5 GB/sec IB

Physical

  • 4 x 8 GB/sec FC ports
  • or 4 x 40 Gbps QDR Infiniband ports
  • 300 VA
  • 1,024 BTU/hr
  • 13.3 Kg
  • 1 rack unit

High Availability including 2-Dimensional RAID

  • Module level Variable Stripe RAID
  • System level RAID5 across flash modules
  • Hot swap modules
  • eMLC (10 x the endurance of MLC)

For those who like to know how things plug together under the covers, the following three graphics take you through conceptual and physical layouts.

FlashSystem Logical

FlashSystem

2D Flash RAID

With IBM’s Variable Stripe RAID, if one die fails in a ten-chip stripe, only the failed die is bypassed, and then data is restriped across the remaining nine chips.

Integration with IBM SAN Volume Controller (and Storwize V7000)

The IBM System Storage Interoperation Center is showing these as supported with IBM POWER and IBM System X (Intel) servers, including VMware 5.1 support.

The IBM FlashSystem is all about being fast and resilient. The system is based on FPGA and hardware logic so as to minimize latency. For those customers who want advanced software features like volume replication, snapshots (ironically called FlashCopy), thin provisioning, broader host support etc, the best way to achieve all of that is by deploying FlashSystem 820 behind a SAN Volume Controller (or Storwize V7000). This can also be used in conjunction with Easy Tier, with the SVC/V7000 automatically promoting hot blocks to the FlashSystem.

I’ll leave you with this customer quote:

“With some of the other solutions we tested, we poked and pried at them for weeks to get the performance where the vendors claimed it should be.  With the RAMSAN we literally just turned it on and that’s all the performance tuning we did.  It just worked out of the box.”

Feeding the hogs—feeding the hogs

NAS Metadata – Sizing for SONAS & Storwize V7000U

Out there in IBM land the field technical and sales people are often given a guideline of between 5% and 10% of total NAS capacity being allocated for metadata on SONAS or Storwize V7000 Unified systems. I instinctively knew that 10% was too high, but like an obedient little cog in the machine I have been dutifully deducting 5% from the estimated nett capacity that I have sized for customers – but no more!

Being able to size metadata more accurately becomes especially important when a customer wants to place the metadata on SSDs so as to speed up file creation/deletion but more particularly inode scans associated with replication or anti-virus.

[updated slightly on 130721]

The theory of gpfs metadata sizing is explained here and the really short version is that in most cases you will be OK with allowing 1 KiB per file per copy of metadata, but the worst case metadata sizing (when using extended attributes, for things like HSM) should be 16.5 KiB * (filecount+directorycount) * 2 for gpfs HA mirroring.

e.g.

  • if you have 20,000 files and directories the metadata space requirement should be no more than 16.5 * 20,000 * 2 = 660,000 KiB = 645 MiB
  • if you have 40 million files and directories the metadata space requirement should be no more than 16.5 * 40,000,000 * 2 = 1,320,000,000 KiB = 1.23 TiB

So why isn’t 5% a good assumption? What I am tending to see is that average file size on a general purpose NAS is around 5MB rather than the default assumption of 1MB or lower. 

So it’s more important to have a conservative estimate of your filecount (and directory count) than it is to know your capacity.

The corollary for me is that budget conscious customers are more likely to be able to afford to buy enough SSDs to host their metadata, because we may be talking 1% rather than 5%.

Note:  When designing SSD RAID sets for metadata, SONAS/V7000U/gpfs will want to mirror the metadata across two volumes, so ideally those volumes should be on different RAID sets.

Because of the big difference between the 16.5 * formula and the 5% to 10% guideline I’d be keen to get additional validation of the formula from other real users of Storwize V7000 Unified or SONAS (or maybe even general gpfs users). Let me know what you are seeing on your own systems out there. Thanks.

What do you get at an IBM Systems Technical Symposium?

What do you get at an IBM Systems Technical Symposium? Well for the event in Auckland, New Zealand November 13-15 I’ve tried to make the storage content as interesting as possible. If you’re interested in attending, send me an email at jkelly@nz.ibm.com and I will put you in contact with Jacell who can help you get registered. There is of course content from our server teams as well, but my focus has been on the storage content, planned as follows:

Erik Eyberg, who has just joined IBM in Houston from Texas Memory Systems following IBM’s recent acquisition of TMS, will be presenting “RAMSAN – The World’s Fastest Storage”. Where does IBM see RAMSAN fitting in and what is the future of flash? Check out RAMSAN on the web, on twitter, on facebook and on youtube.

Fresh from IBM Portugal and recently transferred to IBM Auckland we also welcome Joao Almeida who will deliver a topic that is sure to be one of the highlights, but unfortunately I can’t tell you what it is since the product hasn’t been announced yet (although if you click here you might get a clue).

Zivan Ori, head of XIV software development in Israel knows XIV at a very detailed level – possibly better than anyone, so come along and bring all your hardest questions! He will be here and presenting on:

  • XIV Performance – What you need to know
  • Looking Beyond the XIV GUI

John Sing will be flying in from IBM San Jose to demonstrate his versatility and expertise in all things to do with Business Continuance, presenting on:

  • Big Data – Get IBM’s take on where Big Data is heading and the challenges it presents and also how some of IBM’s products are designed to meet that challenge.
  • ProtecTIER Dedup VTL options, sizing and replication
  • Active/Active datacentres with SAN Volume Controller Stretched Cluster
  • Storwize V7000U/SONAS Global Active Cloud Engine multi-site file caching and replication

Andrew Martin will come in from IBM’s Hursley development labs to give you the inside details you need on three very topical areas:

  • Storwize V7000 performance
  • Storwize V7000 & SVC 6.4 Real-time Compression
  • Storwize V7000 & SVC Thin Provisioning

Senaka Meegama will be arriving from Sydney with three hot topics around VMware and FCoE:

  • Implementing SVC & Storwize V7000 in a VMware Environment
  • Implementing XIV in a VMware Environment
  • FCoE Network Design with IBM System Storage

Jacques Butcher is also coming over from Australia to provide the technical details you all crave on Tivoli storage management:

  • Tivoli FlashCopy Manager 3.2 including Vmware Integration
  • TSM for Virtual Environments 6.4
  • TSM 6.4 Introduction and Update plus TSM Roadmap for 2013

Maurice McCullough will join us from Atlanta, Georgia to speak on:

  • The new high-end DS8870 Disk System
  • XIV Gen3 overview and tour

Sandy Leadbeater will be joining us from Wellington to cover:

  • Storwize V7000 overview
  • Scale-Out NAS and V7000U overview

I will be reprising my Sydney presentations with updates:

  • Designing Scale Out NAS & Storwize V7000 Unified Solutions
  • Replication with SVC and Storwize V7000

And finally, Mike McKenzie will be joining us from Brocade in Australia to give us the skinny on IBM/Brocade FCIP Router Implementation.

SSDs Poll – RAID5 or RAID10?

1920 – a famous event [code]

IBM SAN Volume Controller and Storwize V7000 Global Mirror
_____________________________________________________________

1920 was a big year with many famous events. Space does not permit me to mention them all, so please forgive me if your significant event of 1920 is left off the list:

  • In the US the passing of the 18th Ammendment starts prohibition
  • In the US the passing of the 19th Ammendment gives women the vote [27 years after women in New Zealand had the same right].
  • The Covenant of the League of Nations (and the ILO) come into force, but the US decides not to sign (in part because it grants the league the right to declare war)
  • The US Senate refuses to sign the treaty of Versailles (in part because it was considered too harsh on Germany)
  • Bloody Sunday – British troops open fire on spectators and players during a football match in Dublin killing 14 Irish civilians and wounding 65.
  • Anti-capitalists bomb Wall Street, killing 38 and seriously injuring 143
  • Numerous other wars and revolutions

There is another famous 1920 event however – event code 1920 on IBM SAN Volume Controller and Storwize V7000 Global Mirror, and this event is much less well understood. A 1920 event code tells you that Global Mirror has just deliberately terminated one of the volume relationships you are replicating, in order to maintain good host application performance. It is not an error code as such, it is the result of automated intelligent monitoring and decision making by your Global Mirror system. I’ve been asked a couple of times why Global Mirror doesn’t automatically restart a relationship that has just terminated with a 1920 event code. Think about it. The system has just taken a considered decision to terminate the relationship, why would it then restart it? If you don’t care about host impact then you can set GM up so that it doesn’t terminate it in the first place, but don’t set it up to terminate on host impact and then blindly just restart it as soon as it does what you told it to do. 1920 is a form of congestion control. Congestion can be at any point in the end to end solution:

  • Network bandwidth, latency, QoS
  • SVC/V7000 memory contention
  • SVC/V7000 processor contention
  • SVC/V7000 disk overloading

Before I explain how the system makes the decision to terminate, first let me summarize your options for avoiding 1920. That’s kind of back to front, but everyone wants to know how to avoid 1920 and not so many people really want to know the details of congestion control. Possible methods for avoiding 1920 are: (now includes a few updates in green and a few more added later in red)

  1. Ask your IBM storage specialist or IBM Business Partner about using Global Mirror with Change Volumes (RPO of minutes) rather than traditional Global Mirror (RPO of milliseconds). You’ll need to be at version 6.3 or later of the firmware to run this. Note that VMware SRM support should be in place for GM/CV by the end of September 2012. Note also that the size of a 15 minute cycling change volume is typically going to be less than 1% of the source volumes, so you don’t need a lot of extra space for this.
  2. Ensure that you have optimized your streams – create more consistency groups, and create an empty cg0 if you are using standalone volumes. 
  3. Increase the GMmaxhostdelay parameter from its default of 5 milliseconds. The system monitors the extra host I/O latency due to the tag-and-release processing of each batch of writes, and if this goes above GMmaxhostdelay then the system considers that an undesirable situation.
  4. Increase the GMlinktolerance parameter from its default of 300 seconds. This is the window over which GM tolerates latency exceeding GMmaxhostdelay before deciding to terminate. Although it has been suggested you should not increase this in a VMware environment.
  5. Increase your network bandwidth, your network quality, your network QoS settings or reduce your network latency. Don’t skimp on your network. Buy the licence for performance Monitoring on your FCIP router (e.g. 2498-R06 feature code 7734  “R06 Performance Monitor”). I’m told that using that or using TPC are the two best ways to see what is happening with traffic from a FC perspective. I’m told that looking at traffic/load from an IP traffic monitor is not always going to give you the real story about the replication traffic.
  6. If your SVC/V7000 is constrained then add another I/O group to the system, or more disks at both ends if it is disk constrained. In particular don’t try to run Global Mirror from a busy production SAS/SSD system to a DR system with NL-SAS. You might be able to do that with GM/CV but not with traditional GM.
  7. Make sure there are no outstanding faults showing in the event log.

So now lets move on to actually understanding the approach that SVC/V7000 takes to congestion control. First we need to understand streams. A GM partnership has 16 streams. All standalone volume relationships go into stream 0, consistency group 0 also goes into stream 0, consistency group 1 goes into stream 1, consistency group 2 goes into stream 2, etc, wrapping around as you get beyond 15. Immediately we realize that if we are replicating a lot of standalone volumes that it might make sense to create an empty cg0 so that we spread things around a little. Also, within each stream, each batch of writes must be processed in tag sequence order so having more streams (up to 16 anyway) reduces any potential for one write I/O to get caught in sequence behind a slower one. Also, each stream is sequence-tag-processed by one node. You could ideally have consistency groups in perfect multiples of the number of SVC/V7000 nodes/canisters, so as to spread the processing evenly across all nodes.OK, now let’s look at a few scenarios:

GMmaxhostdelay at 5 ms (default)
GMlinktolerance at 300 seconds (default)
  • If more than a third of the I/Os are slow and that happens repeatedly for 5 minutes, then the internal system controls will terminate the busiest relationship in that stream.
  • The default settings are looking for general slowness in host response caused by the use of GM
  • Maybe you’d be willing to change GMlinktolerance to 600 seconds (10 minutes) and tolerate more impact at peak periods?
GMmaxhostdelay at 100 ms
GMlinktolerance at 30 seconds
  •  If more than a third of the I/Os are extremely slow and that happens repeatedly for 30 seconds, then the internal system controls will terminate the busiest relationship in the stream
  • Looking for short periods of extreme slowness
  • This has been suggested as something to use (after doing your own careful testing) in a VMware environment given that VMware does not tolerate long-outstanding I/Os.

GMlinktolerance at 0 seconds

  • Set gmlinktolerance to 0 and the link will ‘never’ go down even if host I/O is badly affected. This was the default behaviour back in the very early days of SVC/V7000 replication.

At a slightly more detailed level, an approximation of how the gmlinktolerance and gmmaxhostdelay are used together is as follows:

  1. Look every 10 seconds and see if more than a third of the I/Os in any one stream were delayed by more than gmmaxhostdelay
  2. If more than a third were slow then we increase a counter by one for that stream, and if not we decrease the counter by one.
  3. If the counter gets to gmlinktolerance/10 then terminate the busiest relationship in the stream (and issue event code 1920)

Hopefully this goes some way to explaining that event code 1920 is an intelligent parameter-driven means of minimizing host performance impact, it’s not a defect in GM. The parameters give you a lot of freedom to choose how you want to run things, you don’t have to stay with the defaults.

Solving another kind of Global Mirror problem back in 1920.

Real-time Compression Example

Thanks to Patrick Lee for this example of Thin Provisioning and Real-time Compression…

  • A 100 GB logical disk  is presented to Microsoft Windows 2008.
  • Win2K8 creates eight 10GB files on the logical disk (the files are very sparse).
  • 100GB of volume space actually consumes 215MB of thin provisioned space.
  • With compression turned on, the consumed space drops to 88 MB.
How much would you save? Get the Comprestimator tool here. You’ll need to sign in with or create an IBM ID. Comprestimator will run on:
  • Red Hat Enterprise Linux Version 5 (64-bit)
  • ESXi 5.0
  • AIX V6.1, V7.1
  • Windows 2003 Server, Windows 2008 Server (32-bit and 64-bit)

Comprestimator will sample your actual data – i.e. provides real estimates, not just marketing promises.

Freshly Squeezed

[Note some additional info on RtC core usage added in blue 12th June 2012]

[Note also that testing shows that RtC works best with small block I/O e.g. databases, 4K, 8K, and has higher performance impact on larger I/O sizes. 13 March 2013]

Code level 6.4 has just been announced for IBM SAN Volume Controller and Storwize V7000 and among the new features is Realtime Compression (RtC) for primary data.

Comparing IBM RtC to post-process compression offerings from other vendors is a bit like comparing freshly squeezed orange juice to a box of reconstituted juice. IBM’s Realtime Compression is made fresh as you need it, but sooner or later the other vendors always rely on a big batch process. As it turns out, not only is reconstituted juice not very fresh, but neither is that box of not-from-concentrate juice. Only freshly squeezed is freshly squeezed. I found this quite interesting, so let’s digress for a moment…

What they don’t tell you about Not-from-Concentrate juice – Not-from-concentrate juice can be stored for up to a year. First they strip the juice of oxygen, so it doesn’t oxidize, but that also strips it of its flavour providing chemicals. Juice companies hire  fragrance companies to engineer flavour packs to add back to the juice to try to make it taste fresh. Flavour packs aren’t listed as ingredients on the label because they are derived from orange essential oil. The packs added to juice earmarked for the US market contain a lot of ethyl butyrate. Mexicans and Brazilians favour the decanals (aldehydes), or terpene compounds such as valencine.

You can read this and a little more about the difference between freshly squeezed and boxed juice here. http://civileats.com/2009/05/06/freshly-squeezed-the-truth-about-orange-juice-in-boxes/

IBM’s Realtime Compression is based on the Random Access Compression Engine (RACE) that IBM acquired a couple of years ago. The unique offering here is that RtC is designed to work with primary data, not just archival data. It is, as the name implies, completely real-time for both reads and writes. A compressed volume is just another volume type, similar to a thin provisioned volume and new metrics are provided to monitor the performance overhead of the compression.

The system will report the volume size as seen by the host, the thin provisioned space assuming there was no compression, and the real space used nett of thin provisioning and compression savings. Also presented is a quick bubble showing savings from compression across the whole system. Space saving estimates are as per the following table:

Capacity Magic 5.7.4 now supports compression and caters for the variety of data types. Disk Magic will also be updated to take account of compression and a new redbook will be available shortly to cover it as well.

Most performance modelling I have seen on Storwize V7000 up until now shows controllers that are less than 10% busy, which is a good thing as RtC will use [up to] 3 out of 4 (Storwize V7000, SVC CF8) or 4 out of 6 (SVC CG8) CPU cores and 2GB of RAM. The GUI and other services still get to use the cores that RtC ‘owns’, but non-compressed I/O gets routed to the other cores. There has always been some hard-wiring of SVC cores, but we just haven’t talked about it before. The GUI can’t run on more than 2 out of 6 cores for example, and non-compressed I/O will never use more than 4 cores, that’s the way it’s always been, and RtC doesn’t change that.

Anyway, if you are more than 20% CPU busy on your current SVC or Storwize V7000 systems [extremely unlikely as SVC is a very low-CPU consumption architecture] the best way to deploy RtC would be to add another I/O group to your clustered system. I expect future hardware enhancements will see more cores per system. Storwize V7000 is currently a single 4 core processor per node, so there’s plenty of scope for increase there.

RtC is a licensed feature – licensed per enclosure on Storwize V7000 and per TB on SVC. In the coming weeks we will see how the pricing works out and that will determine the practical limits on the range of use cases. [Looks like it’s pretty cost-effective from what I’ve seen so far].

RACE sits below the copy services in the stack, so they all get to take advantage of the compression. RACE is integrated into the Thin Provisioning layer of the code so all of the usual Thin Provisioning capabilities like auto-expand are supported.

When you add a volume mirror you can choose to add the mirror as a compressed volume, which will be very useful for converting your existing volumes.

IBM’s patented approach to compression is quite different from the other vendors’.

Fixed Input : Variable Output – Netapp takes 32K chunks and spits them into some number of 4K compressed chunks with some amount of padding, but Netapp block re-writes are not compressed in real-time so the volume will grow as it’s used. Most workloads need to be run as post-process compression and you will need to be very careful of the interactions with Snapshots because of the way Netapp stores snaps inside the original volume.

Variable Input : Fixed Output – IBM’s RtC is designed for use on primary data. It takes a large variable input stream e.g. up to 160K in some cases (so has a larger scope to look for a repeated bit stream = better compression rates) and spits the compressed data out into a 32K fully allocated compressed chunk. Writing out a fixed 32K with no padding is more efficient and a key benefit is that all re-writes continue to be compressed. This is a completely real-time solution.

Note that RtC is not yet available on Storwize V7000 Unified.


Letter from America

I’m currently in Los Gatos, California for a month learning all about the inner workings of SAN Volume Controller and Storwize V7000 copy services. I have my next storage post planned for June 4th or 5th, and once the new SVC and Storwize V7000 Copy Services Redbook is published I might also post some personal highlights from that as well.

Meanwhile I’m adjusting to life in Silicon Valley – lots of sun, lots of (polite) people, lots of cars, lots of dogs, not many adverbs (adjectives are preferred).

This morning I took a walk up to St Joseph’s Hill above Los Gatos.

And this afternoon I visited Hakone Japanese garden in Saratoga.

Hot tip for any New Zealanders or Australians travelling to the Bay area: Cost Plus World Market sells Vegemite.

Drive Rebuilds Continued…

I’ve been too busy to blog recently, but I have just paused long enough to add a significant update regarding IBM Storwize V7000 drive rebuild times to my earlier post on RAID rebuild times. Rather than start a new post I thought it best to keep it all together in the one place, so I have added it as point 7 in “Hu’s on first, Tony’s on second, I Don’t Know’s on third”

 

FCIP Routers – A Best Practice Design Tip

Many years ago a Glaswegian friend of mine quoted someone as saying that the 1981 anti-apartheid protests in New Zealand (South African rugby tour) showed that New Zealand was not just a floating Surrey as some had previously suspected. While the Surrey reference might be lost on those not from England, I can tell you there are some distinct cultural and language differences between NZ and England.

For example, there was a (not very good) punk band called ‘Rooter’ back in the late 1970’s in New Zealand. They ended up having to change their name to The Terrorways because ‘Rooter’ was  considered too offensive by the managers of many pubs and clubs.

I guess that’s why in NZ we always pronounce ‘router’ to rhyme with ‘shouter’ even though we pronounce ‘route’ to rhyme with ‘shoot’. We’re kind of stuck in the middle between British and American English.

Pronunciation issues aside however, FCIP routers are a highly reliable way to connect fabrics and allow replication over the WAN between fibre channel disk systems. The price of FCIP routers seems to have halved over the last year or so, which is handy and live replicated DR sites have become much more commonplace in the midrange space in the last couple of years.

Apart from the WAN itself (which is the source of most replication problems) there are a couple of other things that it’s good to be aware of when assembling a design and bill of materials for FCIP routers.

  1. When you’re using the IBM SAN06B-R (Brocade 7800) we always recommend including the licence for ‘Integrated Routing’ if you’re going out over the WAN. This prevents the fabrics at either end of an FCIP link from merging. If a WAN link bounces occasionally as many do, you want to protect your fabrics from repeatedly having to work out who’s in charge and stalling traffic on the SAN while they do that. Without IR your WAN FCIP environment might not really even be supportable.
  2. Similarly I usually recommend the ‘Advanced Performance Monitoring’ feature. If you run into replication performance problems APM will tell you what the FC app is actually seeing rather than you having to make assumptions based on IP network tools.
  3. The third point is new to me and was the real trigger for this blog post (thanks to Alexis Giral for his expertise in this area) and that is if you have only one router per site (as most do) then best practice is to connect only one fabric at each site as per the diagram below.

The reason for this is that the routers and the switches all run the same FabricOS and there is a small potential for an error to be propagated across fabrics, even though Integrated Routing supposedly isolates the fabrics. This is something that Alexis tells me he has explored in detail with Brocade and they too recommend this as a point of best practice. If you already have dual-fabric connected single routers then I’m not sure the risk is high enough to warrant a reconfiguration, but if you’re starting from scratch you should not connect them all up. This would also apply if you are using Cisco MDS922i and MDS91xx for example, as all switches and routers would be running NXOS and the same potential for error propagation exists.

Easy Tier is even better than we thought!

IBM storage architects and IBM Business Partners are encouraged to use Disk Magic to model performance when recommending disk systems to meet a customer requirement. Recently v9.1 of Disk magic was released and it listed nine changes from v9. This little gem was one of them:

“The Easy Tier predefined Skew Levels have been updated based on recent measurements.”

Knowing that sometimes low-key mentions like this can actually be quite significant, I thought I’d check it out.

It turns out that v9 had three settings

  • low skew (2)
  • medium skew (3.5)
  • heavy skew (7)

While v9.1 has

  • very low (2)
  • low (3.5)
  • intermediate (7)
  • high (14)
  • very high (24)

If I take a model that I did recently for Storwize V7000 customer:

  • 40 x 450GB 10K 2.5″ drives RAID5
  • 5 x 200GB SSDs RAID5
  • plus hot spares
  • 16KB I/O size
  • 70/30 read/write ratio

The v9 predictions were:

  • 12,000 IOPS at light skew (2)
  • 13,000 IOPS at medium skew (3.5)
  • 17,000 IOPS at heavy skew (7)

I have generally used medium skew (3.5) when doing general sizing, but the help section in Disk Magic now says “In order to get a realistic prediction, we recommend using the high skew (14) option for most typical environments.  Use the intermediate skew level (7) for a more conservative sizing.”

The v9.1 predictions are now:

  • 12,000 IOPS at very low (2)
  • 13,000 IOPS at low (3.5)
  • 17,000 IOPS at intermediate (7)
  • 28,000 IOPS at high (14)
  • 52,000 IOPS at very high (24)

So what we can see from this is that the performance hasn’t changed for a given skew, but what was previously considered heavy skew is now classed as intermediate. It seems that field feedback is that I/Os are more heavily skewed towards a fairly small working set as a percentage of the total data. Easy Tier is therefore generally more effective than we had bargained on. So apparently I have been under-estimating Easy Tier by a considerable margin (the difference between 13,000 IOPS and 28,000 IOPS in this particular customer example).

The Disk Magic help also provides this graph to show how the skew relates to real life. “In this chart the intermediate skew curve (the middle one) indicates that for a fast tier capacity of 20%, Easy Tier would move 79% of the Workload (I/Os) to the fast tier.”

For more reading on Easy Tier see the following:

Hu’s on first, Tony’s on second, I Don’t Know’s on third

This post started life earlier this year as a post on the death of RAID-5 being signaled by the arrival of 3TB drives. The point being that you can’t afford to be exposed to a second drive failure for 2 or 3 whole days especially given the stress those drives are under during that rebuild period.

But the more I thought about RAID rebuild times the more I realized how little I actually knew about it and how little most other people know about it. I realized that what I knew was based a little too much on snippets of data, unreliable sources and too many assumptions and extrapolations. Everybody thinks they know something about disk rebuilds, but most people don’t really know much about it at all and thinking you know something is worse than knowing you don’t.

In reading this so far it started to remind me of an old Abbot and Costello sketch.

Anyway you’d think that the folks who should know the real answers might be operational IT staff who watch rebuilds nervously to make sure their systems stay up, and maybe vendor lab staff who you would think might get the time and resources to test these things, but I have found it surprisingly hard to find any systematic information.

I plan to add to this post as information comes to hand (new content in green) but let’s examine what I have been able to find so far:

1. The IBM N Series MS Exchange 2007 best practices whitepaper mentions a RAID-DP (RAID6) rebuild of a 146GB 15KRPM drive in a 14+2 array taking 90 minutes (best case).

Netapp points out that there are many variables to consider, including the setting of raid.reconstruct.perf_impact at either low, medium or high, and they warn that a single reconstruction effectively doubles the I/O occurring on the stack/loop, which becomes a problem when the baseline workload is more than 50%.

Netapp also says that rebuild times of 10-15 hours are normal for 500GB drives, and 10-30 hours for 1TB drives.

2. The IBM DS5000 Redpiece “Considerations for RAID-6 Availability and Format/Rebuild Performance on the DS5000” shows the following results for array rebuild times on 300GB drives as the arrays get bigger:

I’m not sure how we project this onto larger drive sizes without more lab data. In these two examples there was little difference between N Series 14+2 146GB and DS5000 14+2 300GB, but common belief is that rebuild times rise proportionally to drive size. The 2008 Hitachi whitepaper “Why Growing Businesses Need RAID 6 Storage” however, mentions a minimum of 24 hours for a rebuild of an array with just 11 x 1TB drives in it on an otherwise idle disk system.

What both IBM and Netapp seem to advise is that rebuild time is fairly flat until you get above 16 drives, although Netapp seems to be increasingly comfortable with larger RAID sets as well.

3. A 2008 post from Tony Pearson suggests that “In a typical RAID environment, say 7+P RAID-5, you might have to read 7 drives to rebuild one drive, and in the case of a 14+2 RAID-6, reading 15 drives to rebuild one drive. It turns out the performance bottleneck is the one drive to write, and today’s systems can rebuild faster Fibre Channel (FC) drives at about 50-55 MB/sec, and slower ATA disk at around 40-42 MB/sec. At these rates, a 750GB SATA rebuild would take at least 5 hours.”

Extrapolating from that would suggest that a RAID5 1TB rebuild is going to take at least 9 hours, 2TB 18 hours, and 3TB 27 hours. The Hitachi whitepaper figure seems to be a high outlier, perhaps dependent on something specific to the Hitachi USP architecture.

Tony does point out that his explanation is a deliberate over-simplification for the purposes of accessibility, perhaps that’s why it doesn’t explain why there might be step increases in drive rebuild times at 8 and 16 drives.

4. The IBM DS8000 Performance Monitoring and Tuning redbook states “RAID 6 rebuild times are close to RAID 5 rebuild times (for the same size disk drive modules (DDMs)), because rebuild times are primarily limited by the achievable write throughput to the spare disk during data reconstruction.” and also “For array rebuilds, RAID 5, RAID 6, and RAID 10 require approximately the same elapsed time, although RAID 5 and RAID 6 require significantly more disk operations and therefore are more likely to impact other disk activity on the same disk array.”

The below image just came to hand. It shows how the new predictive rebuilds feature on DS8000 can reduce rebuild times. Netapp do a similar thing I believe. Interesting that it does show a much higher rebuild rate than the 50MB/sec that is usually talked about.

5. The EMC whitepaper “The Effect of Priorities on LUN Management Operations” focuses on the effect of assigned priority as one would expect, but is nonetheless very useful in helping to understanding generic rebuild times (although it does contain a strange assertion that SATA drives rebuild faster than 10KRPM drives, which I assume must be a tranposition error). Anyway, the doc broadly reinforces the data from IBM and Netapp, including this table.

This seems to show that increase in rebuild times is more linear as the RAID sets get bigger, as compared to IBM’s data which showed steps at 8 and 16. One person with CX4 experience reported to me that you’d be lucky to get close to 30MB/sec on a RAID5 rebuild on a typical working system and when a vault drive is rebuilding with priority set to ASAP not much else gets done on the system at all. It remains unclear to me how much of the vendor variation I am seeing is due to reporting differences and detail levels versus architectural differences.

6. IBM SONAS 1.3 reports a rebuild time of only 9.8 hours for a 3TB drive RAID6 8+2 on an idle system, and 6.1 hours on a 2TB drive (down from 12 hours in SONAS 1.2). This change from 12 hours down to 6.1 comes simply from a code update, so I guess this highlights that not all constraints on rebuild are physical or vendor-generic.

7. March 2012: I just found this pic from the IBM Advanced Technical Skills team in the US. This gives me the clearest measure yet of rebuild times on IBM’s Storwize V7000. Immediately obvious is that the Nearline drive rebuild times stretch out a lot when the target rebuild rate is limited so as to reduce host I/O impact, but the SAS and SSD drive rebuild times are pretty impressive. The table also came with an comment estimating that 600GB SAS drives would take twice the rebuild time of the 300GB SAS drives shown.

~

In 2006 Hu Yoshida posted that “it is time to replace 20 year old RAID architectures with something that does not impact I/O as much as it does today with our larger capacity disks. This is a challenge for our developers and researchers in Hitachi.”

I haven’t seen any sign of that from Hitachi, but IBM’s XIV RAID-X system is perhaps the kind of thing he was contemplating. RAID-X achieves re-protection rates of more than 1TB of actual data per hour and there is no real reason why other disk systems couldn’t implement the scattered RAID-X approach that XIV uses to bring a large number of drives into play on data rebuilds, where protection is about making another copy of data blocks as quickly as possible, not about drive substitution.

So that’s about as much as I know about RAID rebuilds. Please feel free to send me your own rebuild experiences and measurements if you have any.

XIV Gen3 Sequential Performance

Big Data can take a variety of forms but what better way to get a feeling for the performance of a big data storage system than using a standard audited benchmark to measure large file processing, large query processing, and video streaming.

From the www.storageperformance.org website:

“SPC-2 consists of three distinct workloads designed to demonstrate the performance of a storage subsystem during… large-scale, sequential movement of data…

  • Large File Processing: Applications… which require simple sequential process of one or more large files such as scientific computing and large-scale financial processing.
  • Large Database Queries: Applications that involve scans or joins of large relational tables, such as those performed for data mining or business intelligence.
  • Video on Demand: Applications that provide individualized video entertainment to a community of subscribers by drawing from a digital film library.”

The Storage Performance Council also recently published its first SPC-2E benchmark result. “The SPC-2/E benchmark extension consists of the complete set of SPC-2 performance measurement and reporting plus the measurement and reporting of energy use.”

It uses the same performance test as the SPC-2 so the results can be compared. It does look as though only IBM and Oracle are publishing SPC-2 numbers these days however and the IBM DS5300 and DS5020 are the same LSI OEM boxes as the Oracle 6780 and 6180, so that doesn’t really add a lot to the mix. HP and HDS seem to have fled some time ago, and although Fujitsu and Texas Memory do publish, I have never encountered either of those systems out in the market. So the SPC-2 right now is mainly a way to compare sequential performance among IBM systems.

XIV is certainly interesting, because in its Generation 2 format it was never marketed as a box for sequential or single-threaded workloads. XIV Gen2 was a box for random workloads, and the more random and mixed the workload the better it seemed to be. With XIV Generation 3 however we have a system that is seen to be great with sequential workloads, especially Large File Processing, although not quite so strong for Video on Demand.

The distinguishing characteristic of LFP is that it is a read/write workload, while the others appear to be read-only. XIV’s strong write performance comes through on the LFP benchmark.

Drilling down one layer deeper we can look at the components that make up Large File Processing. Sub-results are reported for reads, writes, and mixed read/write, as well as for 256 KiB and 1,024 KiB I/O sizes in each category.

So what we see is that XIV is actually slightly faster than DS8800 on the write workloads, but falls off a little when the read percentage of the I/O mix is higher.

NAS Robot Wars

The new Storwize V7000 Unified (Storwize V7000U) enhancements mean that IBM’s common NAS software stack (first seen in SONAS) for CIFS/NFS/FTP/HTTP/SCP is now deployed into the midrange.

Translating that into simpler language:

IBM is now doing its own mid-range NAS/Block Unified disk systems.

Anyone who has followed the SONAS product (and my posts on said product) will be familiar with the functions of IBM’s common NAS software stack, but the heart of the value is the file-based ILM capability, now essentially being referred to as the Active Cloud Engine.

The following defining image of the Active Cloud Engine is taken from an IBM presentation:

What the file migration capability does is place files onto a specific tier of disk depending on the user-defined policy.

e.g. when disk tier1 hits 80% full, move any files that have not been accessed for more than 40 days to tier2.

Importantly these files keep their original place in the directory tree.

The file-based disk to disk migration is built-in, and does not require any layered products or additional licensing.

Files can also be migrated off to tape as required without losing their place in the same directory tree, using HSM which is licensed separately.

Another important feature that IBM’s competitors don’t have is that although there are two file services modules in every Storwize V7000U operating in active/active configuration they present a single namespace to the users e.g. all of the storage can be presented to a single S: drive.

And the final key feature I wanted to mention was the unified management interface for file and block services, another feature which some of our competitors lack.

Naturally there are many other features of the Storwize V7000U, most of which you’ll find mentioned on the standard Storwize V7000 product page and the Storwize V7000 Unified infocenter.

Today IBM also announces SONAS 1.3, as well as a 243TB XIV model based on 3TB drives, SVC split cluster up to 300Kms, Block replication compatibility between SVC and Storwize V7000, Snapshot-based replication option for SVC and Storwize V7000 and an assortment of Tivoli software enhancements.

Check out IBM bloggers Tony Pearson, Barry Whyte and Rawley Burbridge who have more details.

Meanwhile talking about Active Cloud Engine as a kind of robot reminded me of another robot. Although I have never really been at ease with the ugly competitiveness of capitalism, I do hate losing, so perhaps this is a more apt image to show how we see the Active Cloud Engine ‘robot’ stacking up against the competition.

And here are some other Killer Robots:

The Big Bang Theory “The Killer Robot

Hypno-Disc Vs Pussycat

Razer Vs Onslaught

Jamie Hyneman’s (MythBuster) robot Blendo in action against DoMore

A Small Challenge with NAS Gateways

SAN Volume Controller

Late in 2010, Netapp quietly announced they were not planning to support V Series (and by extension IBM N Series NAS Gateways) to be used with any recent version of IBM’s SAN Volume Controller.

This was discussed more fully on the Netapp communities forum (you’ll need to create a login) and the reason given was insufficient sales revenue to justify on-going support.

This is to some extent generically true for all N Series NAS gateways. For example, if all you need is basic CIFS access to your disk storage, most of the spend still goes on the disk and the SVC licensing, not on the N Series gateway. This is partly a result of the way Netapp prices their systems – the package of the head units and base software (including the first protocol) is relatively cheap, while the drives and optional software features are relatively expensive.

Netapp however did not withdraw support for V Series NAS gateways on XIV or DS8000, and nor do they seem to have any intention to, as best I can tell, considering that support to be core capability for V Series NAS Gateways.

I also note that Netapp occasionally tries to position V Series gateways as a kind of SVC-lite, to virtualize other disk systems for block I/O access.

Anyway, it was interesting that what IBM announced was a little different to what Netapp announced “NetApp & N Series Gateway support is available with SVC 6.2.x for selected configurations via RPQ [case-by-case lab approval] only

Storwize V7000

What made this all a bit trickier was IBM’s announcement of the Storwize V7000 as its new premier midrange disk system.

Soon after on the Netapp communities forum it was stated that there was a “joint decision” between Netapp and IBM that there would be no V Series NAS gateway support and no PVRs [Netapp one-off lab support] for Storwize V7000 either.

Now the Storwize V7000 disk system, which is projected to have sold close to 5,000 systems in its first 12 months, shares the same code-base and features as SVC (including the ability to virtualize other disk systems). So think about that for a moment, that’s two products and only one set of testing and interface support – that sounds like the support ROI just improved, so maybe you’d think that the original ROI objection might have faded away at this point? It appears not.

Anyway, once again, what IBM announced was a little different to the Netapp statement “NetApp & N Series Gateway support is available with IBM Storwize V7000 6.2.x for selected configurations via RPQ only“.

Whither from here?

The good news is that IBM’s SONAS gateways support XIV and SVC (and other storage behind SVC) and SONAS delivers some great features that N Series doesn’t have (such as file-based ILM to disk or tape tiers) so SVC is pretty well catered for when it comes to NAS gateway funtionality.

When it comes to Storwize V7000 the solution is a bit trickier. SONAS is a scale-out system designed to cater for 100’s of TBs up to 14 PBs. That’s not an ideal fit for the midrange Storwize V7000 market. So the Netapp gateway/V-series announcement has created potential difficulties for IBM’s midrange NAS gateway portfolio… hence the title of this blog post.

World’s most affordable high-function 500TB+ block I/O disk solution

Gotta love this price-optimized solution for two tier disk… (plus Easy Tier automatic SSD read/write tiering).

This is possibly the most affordable high-function 500TB+ disk solution on the planet… and it all fits into only 32u of rack space!

Yeah I know it’s a completely arbitrary solution, but it does show what’s possible when you combine Storwize V7000’s external virtualization capability with DCS3700’s super high density packaging which perfectly exploits the “per tray” licensing for both Storwize V7000 and for TPC for Disk MRE. The Storwize V7000 also provides easy in-flight volume migration between tiers, not to mention volume striping, thin provisioning, QoS, snapshots, clones, easy volume migration off legacy disk systems, 8Gbps FC & 10Gbps iSCSI.


Check out the component technologies:

Storwize V7000 at www.ibm.com/storage/disk/storwize_v7000 

DCS3700 at www.ibm.com/storage/disk/dcs3700

and TPC for Disk at  www.ibm.com/storage/software/center/disk/me

Nearline-SAS: Who Dares Wins

Maybe you think NL-SAS is old news and it’s already swept SATA aside?

Well if you check out the specs on FAS, Isilon, 3PAR, or VMAX, or even the monolithic VSP, you will see that they all list SATA drives, not NL-SAS on their spec sheets.

Of the serious contenders, it seems that only VNX, Ibrix, IBM SONAS, IBM XIV Gen3 and IBM Storwize V7000 have made the move to NL-SAS so far.

First we had PATA (Parallel ATA) and then SATA drives, and then for a while we had FATA drives (Fibre Channel attached ATA) or what EMC at one point confusingly  marketed as “low-cost Fibre Channel”. These were ATA drive mechanics, with SCSI command sets handled by a FC front-end on the drive.

Now we have drives that are being referred to as Capacity-Optimized SAS, or Nearline SAS (NL-SAS) both of which terms once again have the potential to be confusing. NL-SAS is a similar concept to FATA – mechanically an ATA drive (head, media, rotational speed) – but with a SAS interface (rather than a FC bridge) to handle the SCSI command set.

When SCSI made the jump from parallel to serial the designers took the opportunity to build in compatibility with SATA via a SATA tunneling protocol, so SAS controllers can support both SAS and SATA drives.

The reason we use ATA drive mechanics is that they have higher capacity and a lower price. So what are some of the advantages of using NL-SAS drives, over using traditional SATA drives?

  1. SCSI offers more sophisticated command queuing (which leads directly to reduced head movement) although ATA command queuing enhancements have closed the gap considerably in recent years.
  2. SCSI also offers better error handling and reporting.
  3. One of the things I learned the hard way when working with Engenio disk systems is that bridge technology to go from FC to SATA can introduce latency, and as it turns out, so does the translation required from a SAS controller to a SATA drive. Doing SCSI directly to a NL-SAS drive reduces controller latency, reduces load on the controller and also simplifies debugging.
  4. Overall performance can be anything from slightly better to more than double, depending on the workload.

And with only a small price premium over traditional SATA, it seems pretty clear to me that NL-SAS will soon come to dominate and SATA will be phased out over time.

NL-SAS drives also offer the option of T10 PI (SCSI Protection Information) which adds 8 bytes of data integrity field to each 512b disk block. The 8 bytes is split into three chunks allowing for cyclic redundancy check, application tagging (e.g.RAID information), and reference tagging to make sure the data blocks arrive in the right order. I expect 2012 to be a big year for PI deployment.

I’m assured that the photograph below is of a SAS engineer – maybe he’s testing the effectiveness of the PI extensions on the disk drive in his pocket?

Storwize V7000 four-fold Scalability takes on VMAX & 3PAR

IBM recently announced that two Storwize V7000 systems could be clustered, in pretty much exactly the same way that two iogroups can be clustered in a SAN Volume Controller environment. Clustering two Storwize V7000s creates a system with up to 480 drives and any of the paired controllers can access any of the storage pools. Barry Whyte went one step further and said that if you apply for an RPQ you can cluster up to four Storwize V7000s (up to 960 drives). Continue reading

Am I boring you? Full stripe writes and other complexity…

In 1978 IBM employee Norman Ken Ouchi was awarded patent 4092732 for a “System for recovering data stored in failed memory unit.” Technology that would later be known as RAID 5 with full stripe writes.

Hands up who’s still doing that or its RAID6 derivative 33 years later?

I have a particular distaste for technologies that need to be manually tuned. Continue reading

You can’t always get what you want

There have been a raft of new storage efficiency elements brought to market in the last few years, but what has become obvious is that you can’t yet get it all in one product. Continue reading

Maximum Fibre Channel Distances

Just a quick hit and run blog post for today… This table authored by Karl Hohenauer just came into my inbox. With the changes in cable quality (OM3, OM4) the supported fibre channel distances have confused a few people, so this will be a good reference doc to remember. Continue reading

Where Should I Shove This Solid State Drive?

Everyone agrees that enterprise-class SSDs from companies like STEC Inc are fast, and cool, and pretty nice. Most people also realise that SSDs are an order of magnitude more expensive than SAS drives, and that there is no expectation that this will change dramatically within the next 5 years. This means we have to figure out how to leverage SSDs without buying a whole lot of them. Continue reading

Storwize V7000 Vs the Rest – a Quick SPC-1 Performance Roundup

This post is in response to the discussion around my recent Easy Tier performance post. Continue reading

Storwize V7000 Easy Tier: SATA RAID10 Vs SAS RAID6

When IBM released it’s SPC-1 Easy Tier benchmark on DS8000 earlier this year, it was done with SATA RAID10 and SSD RAID10, so when we announced Storwize V7000 with Easy Tier for the midrange, the natural assumption was to pair SATA RAID10 and SSD RAID10 again. But it seems to me that 600GB SAS RAID6 + SSD might be a better combination than 2TB SATA  RAID10 + SSD. Continue reading

Exploiting the Intelligence of Inventors

In Tracey Kidder’s book “Soul of a New Machine” I recall Data General’s Tom West as saying that the design that the team at Data General came up with for the MV/8000 minicomputer was so complex that he was worried. He had a friend who had just purchased a first run Digital Equipment Corp VAX, and Tom went to visit him and picked through the VAX main boards counting and recording the IDs of all of the components used. He then realised that his design wasn’t so complex after all, compared to the VAX and so Tom proceeded to build the MV/8000 with confidence.

In this example, deconstruction of one product helped Tom to understand another product, and sanity check that he wasn’t making things too complicated. It didn’t tell him if MV/8000 would be better than VAX however.

I have many times seen buyers approach a storage solution evaluation using a deconstructionist approach. Once a solution is broken down into its isolated elements, it can be compared at a component level to another very different solution. It’s a pointless exercise in most cases. Continue reading

Quality of Service on SAN Volume Controller & Storwize V7000

I learned something new recently. SVC has QoS, and has had it for quite some time (maybe since day 1?). Continue reading

IBM’s New Midrange with Easy Tier & External Virtualization

Yes, IBM has announced a new midrange virtualized disk system, the Storwize V7000. A veritable CLARiiON-killer : ) Continue reading

Choice or Clutter?

Vendors often struggle to be strong in all market segments and address the broad range of customer requirements with a limited range of products. Products that fit well into one segment don’t always translate well to others, especially when trying to bridge both midrange and enterprise requirements. Continue reading

%d bloggers like this: