IBM FlashSystem: Feeding the Hogs

IBM has announced its new FlashSystem family following on from the acquisition of Texas Memory Systems (RAMSAN) late last year.

The first thing that interests me is where FlashSystem products are likely to play in 2013 and this graphic is intended to suggest some options. Over time the blue ‘candidate’ box is expected to stretch downwards.

Resource hogs

Flash Candidates2

For the full IBM FlashSystem family you can check out the product page at http://www.ibm.com/storage/flash

Probably the most popular product will be the FlashSystem 820, they key characteristics of which are as follows:

Usable capacity options with RAID5

  • 10.3 TB per FlashSystem
  • 20.6 TB per FlashSystem
  • Up to 865 TB usable in a single 42u rack

Latency

  • 110 usec read latency
  • 25 usec write latency

IOPS

  • Up to 525,000 4KB random read
  • Up to 430,000 4KB 70/30 read/write
  • Up to 280,000 4KB random write

Throughput

  • up to 3.3 GB/sec FC
  • up to 5 GB/sec IB

Physical

  • 4 x 8 GB/sec FC ports
  • or 4 x 40 Gbps QDR Infiniband ports
  • 300 VA
  • 1,024 BTU/hr
  • 13.3 Kg
  • 1 rack unit

High Availability including 2-Dimensional RAID

  • Module level Variable Stripe RAID
  • System level RAID5 across flash modules
  • Hot swap modules
  • eMLC (10 x the endurance of MLC)

For those who like to know how things plug together under the covers, the following three graphics take you through conceptual and physical layouts.

FlashSystem Logical

FlashSystem

2D Flash RAID

With IBM’s Variable Stripe RAID, if one die fails in a ten-chip stripe, only the failed die is bypassed, and then data is restriped across the remaining nine chips.

Integration with IBM SAN Volume Controller (and Storwize V7000)

The IBM System Storage Interoperation Center is showing these as supported with IBM POWER and IBM System X (Intel) servers, including VMware 5.1 support.

The IBM FlashSystem is all about being fast and resilient. The system is based on FPGA and hardware logic so as to minimize latency. For those customers who want advanced software features like volume replication, snapshots (ironically called FlashCopy), thin provisioning, broader host support etc, the best way to achieve all of that is by deploying FlashSystem 820 behind a SAN Volume Controller (or Storwize V7000). This can also be used in conjunction with Easy Tier, with the SVC/V7000 automatically promoting hot blocks to the FlashSystem.

I’ll leave you with this customer quote:

“With some of the other solutions we tested, we poked and pried at them for weeks to get the performance where the vendors claimed it should be.  With the RAMSAN we literally just turned it on and that’s all the performance tuning we did.  It just worked out of the box.”

Feeding the hogs—feeding the hogs

XIV 11.2 Quick Update: The Best Just Became Awesome…

Not only is XIV Gen3 proving now to be just about the most robust thing you could ever wish to own, with significant improvements over Gen2, but IBM has just announced some interesting additional enhancements to Gen3, both new hardware and new version 11.2 firmware.

  • A major improvement in performance through improved SSD caching algorithms (including storing checksums in RAM rather than on SSD)
  • New 6 core Intel E5645 CPUs refresh (15 x 6 = 90 physical cores) and optimisation  for hyper-threading (180 logical cores) including some processor affinity optimization for iSCSI.
  • Up to twelve 10G iSCSI ports and 9K jumbo MTU support with tested performance up to 13.7GB/sec sequential read
  • A lot of work has been done on the light-weight IP stack, using Infiniband techniques for DMA so as to remove locking and CPU overhead. This driver runs in user space with very low CPU overhead and can drive iSCSI at full line rate (12 x 10Gbps).
  • The work on iSCSI also has benefits for IP replication, with multiple sessions being used to improve robustness and improve performance, as well as enhancements to concurrent code load.

10G

Some of the other cool things in 11.2 include:

  • The rebuild time for 3TB data (3TB drive 100% full) used to be 76 minutes, which was industry leading, now with 11.2 of the firmware that time has been halved to just 38 minutes, and the rebuild time is virtually unaffected by system user load!
  • Space reclamation enhancements.
  • More efficient power supplies.
  • An export to csv option is now available on every information table in the system

XIV export

So in summary you could say the big points are:

  • Availability is now best in industry
  • Real-world IOPS performance is well into six figures with single digit latency, and it just keeps getting better
  • iSCSI has been made awesome/enterprise-class – quite unlike some other iSCSI implementations around
  • The rebuild time for 3TB of data is so far beyond what the opposition can do that it looks like sorcery

 If you haven’t thought about XIV for a while, it’s time you took another look.

 

Storage Complexity…

This week I’m on a summer camping holiday, so why not head over to Storagebod’s blog and read what The Bod has to say on the critical topic of storage complexity…

NAS Metadata – Sizing for SONAS & Storwize V7000U

Out there in IBM land the field technical and sales people are often given a guideline of between 5% and 10% of total NAS capacity being allocated for metadata on SONAS or Storwize V7000 Unified systems. I instinctively knew that 10% was too high, but like an obedient little cog in the machine I have been dutifully deducting 5% from the estimated nett capacity that I have sized for customers – but no more!

Being able to size metadata more accurately becomes especially important when a customer wants to place the metadata on SSDs so as to speed up file creation/deletion but more particularly inode scans associated with replication or anti-virus.

The theory of gpfs metadata sizing is explained here and the really short version is that the worst case metadata sizing should be 16.5 KiB * (filecount+directorycount) * 2 for gpfs HA mirroring.

e.g.

  • if you have 20,000 files and directories the metadata space requirement should be no more than 16.5 * 20,000 * 2 = 660,000 KiB = 645 MiB
  • if you have 40 million files and directories the metadata space requirement should be no more than 16.5 * 40,000,000 * 2 = 1,320,000,000 KiB = 1.23 TiB

So why isn’t 5% a good assumption? What I am tending to see is that average file size on a general purpose NAS is around 5MB rather than the default assumption of 1MB or lower. 

So it’s more important to have a conservative estimate of your filecount (and directory count) than it is to know your capacity.

The corollary for me is that budget conscious customers are more likely to be able to afford to buy enough SSDs to host their metadata, because we may be talking 1% rather than 5%, but if you do end up with extra SSD space then you can always use that for Easy Tier.

Note:  When designing SSD RAID sets for metadata, SONAS/V7000U/gpfs will want to mirror the metadata across two volumes, so ideally those volumes should be on different RAID sets.

Because of the big difference between the 16.5 * formula and the 5% to 10% guideline I’d be keen to get additional validation of the formula from other real users of Storwize V7000 Unified or SONAS (or maybe even general gpfs users). Let me know what you are seeing on your own systems out there. Thanks.

What do you get at an IBM Systems Technical Symposium?

What do you get at an IBM Systems Technical Symposium? Well for the event in Auckland, New Zealand November 13-15 I’ve tried to make the storage content as interesting as possible. If you’re interested in attending, send me an email at jkelly@nz.ibm.com and I will put you in contact with Jacell who can help you get registered. There is of course content from our server teams as well, but my focus has been on the storage content, planned as follows:

Erik Eyberg, who has just joined IBM in Houston from Texas Memory Systems following IBM’s recent acquisition of TMS, will be presenting “RAMSAN – The World’s Fastest Storage”. Where does IBM see RAMSAN fitting in and what is the future of flash? Check out RAMSAN on the web, on twitter, on facebook and on youtube.

Fresh from IBM Portugal and recently transferred to IBM Auckland we also welcome Joao Almeida who will deliver a topic that is sure to be one of the highlights, but unfortunately I can’t tell you what it is since the product hasn’t been announced yet (although if you click here you might get a clue).

Zivan Ori, head of XIV software development in Israel knows XIV at a very detailed level – possibly better than anyone, so come along and bring all your hardest questions! He will be here and presenting on:

  • XIV Performance – What you need to know
  • Looking Beyond the XIV GUI

John Sing will be flying in from IBM San Jose to demonstrate his versatility and expertise in all things to do with Business Continuance, presenting on:

  • Big Data – Get IBM’s take on where Big Data is heading and the challenges it presents and also how some of IBM’s products are designed to meet that challenge.
  • ProtecTIER Dedup VTL options, sizing and replication
  • Active/Active datacentres with SAN Volume Controller Stretched Cluster
  • Storwize V7000U/SONAS Global Active Cloud Engine multi-site file caching and replication

Andrew Martin will come in from IBM’s Hursley development labs to give you the inside details you need on three very topical areas:

  • Storwize V7000 performance
  • Storwize V7000 & SVC 6.4 Real-time Compression
  • Storwize V7000 & SVC Thin Provisioning

Senaka Meegama will be arriving from Sydney with three hot topics around VMware and FCoE:

  • Implementing SVC & Storwize V7000 in a VMware Environment
  • Implementing XIV in a VMware Environment
  • FCoE Network Design with IBM System Storage

Jacques Butcher is also coming over from Australia to provide the technical details you all crave on Tivoli storage management:

  • Tivoli FlashCopy Manager 3.2 including Vmware Integration
  • TSM for Virtual Environments 6.4
  • TSM 6.4 Introduction and Update plus TSM Roadmap for 2013

Maurice McCullough will join us from Atlanta, Georgia to speak on:

  • The new high-end DS8870 Disk System
  • XIV Gen3 overview and tour

Sandy Leadbeater will be joining us from Wellington to cover:

  • Storwize V7000 overview
  • Scale-Out NAS and V7000U overview

I will be reprising my Sydney presentations with updates:

  • Designing Scale Out NAS & Storwize V7000 Unified Solutions
  • Replication with SVC and Storwize V7000

And finally, Mike McKenzie will be joining us from Brocade in Australia to give us the skinny on IBM/Brocade FCIP Router Implementation.

SSDs Poll – RAID5 or RAID10?

1920 – a famous event [code]

IBM SAN Volume Controller and Storwize V7000 Global Mirror
_____________________________________________________________

1920 was a big year with many famous events. Space does not permit me to mention them all, so please forgive me if your significant event of 1920 is left off the list:

  • In the US the passing of the 18th Ammendment starts prohibition
  • In the US the passing of the 19th Ammendment gives women the vote [27 years after women in New Zealand had the same right].
  • The Covenant of the League of Nations (and the ILO) come into force, but the US decides not to sign (in part because it grants the league the right to declare war)
  • The US Senate refuses to sign the treaty of Versailles (in part because it was considered too harsh on Germany)
  • Bloody Sunday – British troops open fire on spectators and players during a football match in Dublin killing 14 Irish civilians and wounding 65.
  • Anti-capitalists bomb Wall Street, killing 38 and seriously injuring 143
  • Numerous other wars and revolutions

There is another famous 1920 event however – event code 1920 on IBM SAN Volume Controller and Storwize V7000 Global Mirror, and this event is much less well understood. A 1920 event code tells you that Global Mirror has just deliberately terminated one of the volume relationships you are replicating, in order to maintain good host application performance. It is not an error code as such, it is the result of automated intelligent monitoring and decision making by your Global Mirror system. I’ve been asked a couple of times why Global Mirror doesn’t automatically restart a relationship that has just terminated with a 1920 event code. Think about it. The system has just taken a considered decision to terminate the relationship, why would it then restart it? If you don’t care about host impact then you can set GM up so that it doesn’t terminate it in the first place, but don’t set it up to terminate on host impact and then blindly just restart it as soon as it does what you told it to do. 1920 is a form of congestion control. Congestion can be at any point in the end to end solution:

  • Network bandwidth, latency, QoS
  • SVC/V7000 memory contention
  • SVC/V7000 processor contention
  • SVC/V7000 disk overloading

Before I explain how the system makes the decision to terminate, first let me summarize your options for avoiding 1920. That’s kind of back to front, but everyone wants to know how to avoid 1920 and not so many people really want to know the details of congestion control. Possible methods for avoiding 1920 are: (now includes a few updates in green and a few more added later in red)

  1. Ask your IBM storage specialist or IBM Business Partner about using Global Mirror with Change Volumes (RPO of minutes) rather than traditional Global Mirror (RPO of milliseconds). You’ll need to be at version 6.3 or later of the firmware to run this. Note that VMware SRM support should be in place for GM/CV by the end of September 2012. Note also that the size of a 15 minute cycling change volume is typically going to be less than 1% of the source volumes, so you don’t need a lot of extra space for this.
  2. Ensure that you have optimized your streams – create more consistency groups, and create an empty cg0 if you are using standalone volumes. 
  3. Increase the GMmaxhostdelay parameter from its default of 5 milliseconds. The system monitors the extra host I/O latency due to the tag-and-release processing of each batch of writes, and if this goes above GMmaxhostdelay then the system considers that an undesirable situation.
  4. Increase the GMlinktolerance parameter from its default of 300 seconds. This is the window over which GM tolerates latency exceeding GMmaxhostdelay before deciding to terminate. Although it has been suggested you should not increase this in a VMware environment.
  5. Increase your network bandwidth, your network quality, your network QoS settings or reduce your network latency. Don’t skimp on your network. Buy the licence for performance Monitoring on your FCIP router (e.g. 2498-R06 feature code 7734  ”R06 Performance Monitor”). I’m told that using that or using TPC are the two best ways to see what is happening with traffic from a FC perspective. I’m told that looking at traffic/load from an IP traffic monitor is not always going to give you the real story about the replication traffic.
  6. If your SVC/V7000 is constrained then add another I/O group to the system, or more disks at both ends if it is disk constrained. In particular don’t try to run Global Mirror from a busy production SAS/SSD system to a DR system with NL-SAS. You might be able to do that with GM/CV but not with traditional GM.
  7. Make sure there are no outstanding faults showing in the event log.

So now lets move on to actually understanding the approach that SVC/V7000 takes to congestion control. First we need to understand streams. A GM partnership has 16 streams. All standalone volume relationships go into stream 0, consistency group 0 also goes into stream 0, consistency group 1 goes into stream 1, consistency group 2 goes into stream 2, etc, wrapping around as you get beyond 16. Immediately we realize that if we are replicating a lot of standalone volumes that it might make sense to create an empty cg0 so that we spread things around a little. Also, within each stream, each batch of writes must be processed in tag sequence order so having more streams (up to 16 anyway) reduces any potential for one write I/O to get caught in sequence behind a slower one. Also, each stream is sequence-tag-processed by one node. You could ideally have consistency groups in perfect multiples of the number of SVC/V7000 nodes/canisters, so as to spread the processing evenly across all nodes.OK, now let’s look at a few scenarios:

GMmaxhostdelay at 5 ms (default)
GMlinktolerance at 300 seconds (default)
  • If more than a third of the I/Os are slow and that happens repeatedly for 5 minutes, then the internal system controls will terminate the busiest relationship in that stream.
  • The default settings are looking for general slowness in host response caused by the use of GM
  • Maybe you’d be willing to change GMlinktolerance to 600 seconds (10 minutes) and tolerate more impact at peak periods?
GMmaxhostdelay at 100 ms
GMlinktolerance at 30 seconds
  •  If more than a third of the I/Os are extremely slow and that happens repeatedly for 30 seconds, then the internal system controls will terminate the busiest relationship in the stream
  • Looking for short periods of extreme slowness
  • This has been suggested as something to use (after doing your own careful testing) in a VMware environment given that VMware does not tolerate long-outstanding I/Os. (Perhaps a little more moderate would be something like 10, 60 rather than 100, 30).

GMlinktolerance at 0 seconds

  • Set gmlinktolerance to 0 and the link will ‘never’ go down even if host I/O is badly affected. This was the default behaviour back in the very early days of SVC/V7000 replication.

At a slightly more detailed level, an approximation of how the gmlinktolerance and mmaxhostdelay are used together is as follows:

  1. Look every 10 seconds and see if more than a third of the I/Os in any one stream were delayed by more than gmmaxhostdelay
  2. If more than a third were slow then we increase a counter by one for that stream, and if not we decrease the counter by one.
  3. If the counter gets to gmlinktolerance/10 then terminate the busiest relationship in the stream (and issue event code 1920)

Hopefully this goes some way to explaining that event code 1920 is an intelligent parameter-driven means of minimizing host performance impact, it’s not a defect in GM. The parameters give you a lot of freedom to choose how you want to run things, you don’t have to stay with the defaults.

Solving another kind of Global Mirror problem back in 1920.

Follow

Get every new post delivered to your Inbox.

Join 102 other followers