1920 – a famous event [code]

IBM SAN Volume Controller and Storwize V7000 Global Mirror
_____________________________________________________________

1920 was a big year with many famous events. Space does not permit me to mention them all, so please forgive me if your significant event of 1920 is left off the list:

  • In the US the passing of the 18th Ammendment starts prohibition
  • In the US the passing of the 19th Ammendment gives women the vote [27 years after women in New Zealand had the same right].
  • The Covenant of the League of Nations (and the ILO) come into force, but the US decides not to sign (in part because it grants the league the right to declare war)
  • The US Senate refuses to sign the treaty of Versailles (in part because it was considered too harsh on Germany)
  • Bloody Sunday – British troops open fire on spectators and players during a football match in Dublin killing 14 Irish civilians and wounding 65.
  • Anti-capitalists bomb Wall Street, killing 38 and seriously injuring 143
  • Numerous other wars and revolutions

There is another famous 1920 event however – event code 1920 on IBM SAN Volume Controller and Storwize V7000 Global Mirror, and this event is much less well understood. A 1920 event code tells you that Global Mirror has just deliberately terminated one of the volume relationships you are replicating, in order to maintain good host application performance. It is not an error code as such, it is the result of automated intelligent monitoring and decision making by your Global Mirror system. I’ve been asked a couple of times why Global Mirror doesn’t automatically restart a relationship that has just terminated with a 1920 event code. Think about it. The system has just taken a considered decision to terminate the relationship, why would it then restart it? If you don’t care about host impact then you can set GM up so that it doesn’t terminate it in the first place, but don’t set it up to terminate on host impact and then blindly just restart it as soon as it does what you told it to do. 1920 is a form of congestion control. Congestion can be at any point in the end to end solution:

  • Network bandwidth, latency, QoS
  • SVC/V7000 memory contention
  • SVC/V7000 processor contention
  • SVC/V7000 disk overloading

Before I explain how the system makes the decision to terminate, first let me summarize your options for avoiding 1920. That’s kind of back to front, but everyone wants to know how to avoid 1920 and not so many people really want to know the details of congestion control. Possible methods for avoiding 1920 are: (now includes a few updates in green and a few more added later in red)

  1. Ask your IBM storage specialist or IBM Business Partner about using Global Mirror with Change Volumes (RPO of minutes) rather than traditional Global Mirror (RPO of milliseconds). You’ll need to be at version 6.3 or later of the firmware to run this. Note that VMware SRM support should be in place for GM/CV by the end of September 2012. Note also that the size of a 15 minute cycling change volume is typically going to be less than 1% of the source volumes, so you don’t need a lot of extra space for this.
  2. Ensure that you have optimized your streams – create more consistency groups, and create an empty cg0 if you are using standalone volumes. 
  3. Increase the GMmaxhostdelay parameter from its default of 5 milliseconds. The system monitors the extra host I/O latency due to the tag-and-release processing of each batch of writes, and if this goes above GMmaxhostdelay then the system considers that an undesirable situation.
  4. Increase the GMlinktolerance parameter from its default of 300 seconds. This is the window over which GM tolerates latency exceeding GMmaxhostdelay before deciding to terminate. Although it has been suggested you should not increase this in a VMware environment.
  5. Increase your network bandwidth, your network quality, your network QoS settings or reduce your network latency. Don’t skimp on your network. Buy the licence for performance Monitoring on your FCIP router (e.g. 2498-R06 feature code 7734  ”R06 Performance Monitor”). I’m told that using that or using TPC are the two best ways to see what is happening with traffic from a FC perspective. I’m told that looking at traffic/load from an IP traffic monitor is not always going to give you the real story about the replication traffic.
  6. If your SVC/V7000 is constrained then add another I/O group to the system, or more disks at both ends if it is disk constrained. In particular don’t try to run Global Mirror from a busy production SAS/SSD system to a DR system with NL-SAS. You might be able to do that with GM/CV but not with traditional GM.
  7. Make sure there are no outstanding faults showing in the event log.

So now lets move on to actually understanding the approach that SVC/V7000 takes to congestion control. First we need to understand streams. A GM partnership has 16 streams. All standalone volume relationships go into stream 0, consistency group 0 also goes into stream 0, consistency group 1 goes into stream 1, consistency group 2 goes into stream 2, etc, wrapping around as you get beyond 16. Immediately we realize that if we are replicating a lot of standalone volumes that it might make sense to create an empty cg0 so that we spread things around a little. Also, within each stream, each batch of writes must be processed in tag sequence order so having more streams (up to 16 anyway) reduces any potential for one write I/O to get caught in sequence behind a slower one. Also, each stream is sequence-tag-processed by one node. You could ideally have consistency groups in perfect multiples of the number of SVC/V7000 nodes/canisters, so as to spread the processing evenly across all nodes.OK, now let’s look at a few scenarios:

GMmaxhostdelay at 5 ms (default)
GMlinktolerance at 300 seconds (default)
  • If more than a third of the I/Os are slow and that happens repeatedly for 5 minutes, then the internal system controls will terminate the busiest relationship in that stream.
  • The default settings are looking for general slowness in host response caused by the use of GM
  • Maybe you’d be willing to change GMlinktolerance to 600 seconds (10 minutes) and tolerate more impact at peak periods?
GMmaxhostdelay at 100 ms
GMlinktolerance at 30 seconds
  •  If more than a third of the I/Os are extremely slow and that happens repeatedly for 30 seconds, then the internal system controls will terminate the busiest relationship in the stream
  • Looking for short periods of extreme slowness
  • This has been suggested as something to use (after doing your own careful testing) in a VMware environment given that VMware does not tolerate long-outstanding I/Os. (Perhaps a little more moderate would be something like 10, 60 rather than 100, 30).

GMlinktolerance at 0 seconds

  • Set gmlinktolerance to 0 and the link will ‘never’ go down even if host I/O is badly affected. This was the default behaviour back in the very early days of SVC/V7000 replication.

At a slightly more detailed level, an approximation of how the gmlinktolerance and mmaxhostdelay are used together is as follows:

  1. Look every 10 seconds and see if more than a third of the I/Os in any one stream were delayed by more than gmmaxhostdelay
  2. If more than a third were slow then we increase a counter by one for that stream, and if not we decrease the counter by one.
  3. If the counter gets to gmlinktolerance/10 then terminate the busiest relationship in the stream (and issue event code 1920)

Hopefully this goes some way to explaining that event code 1920 is an intelligent parameter-driven means of minimizing host performance impact, it’s not a defect in GM. The parameters give you a lot of freedom to choose how you want to run things, you don’t have to stay with the defaults.

Solving another kind of Global Mirror problem back in 1920.

About these ads

19 Responses

  1. Great post. I hate 1920 and we are going to implement a script to autorestartb GM better then the old one cited in redbook and coming from my past.

    • Do you really think automating a restart is the right approach? If you don’t intend to do anything about the congestion that’s being signalled, wouldn’t you be better just to increase GMlinktolerance so the relationship doesn’t terminate quite so readily in the first place?

      • The nice thing about having the 1920 and restarting it automatically is that it doesn’t slow down the host during the re-sync. We’re actually running into problems where the GMLinkTolerance is causing unacceptable response time issues on our hosts.

        It would be nice if we could figure out a way to make GM run faster, but it doesn’t even fill up our replication pipe.

        • It would be interesting to know where your bottleneck is. I saw one recently where sata drives at the DR site were the cause of the 1920s. You need to look at all the elements carefully and work out where it is.
          Or if you want to make your life really simple, switch to global mirror with change volumes (cyclingmode).

          • We’d like to know, too. We’re using 15K FC drives at both ends. Same basic config at each end. 8 node SVC CF8 Cluster, DS8700. Cisco FCIP.

            My best guess at this point is the 30ms latency between sites. SVC doesn’t seem to stream the data down very efficiently.

  2. Piece of your blog:
    If more than a third of the I/Os are slow and that happens repeatedly for 5 minutes, then terminate the busiest relationship in that stream.

    Q: How do you check this? Do you use TPC or other tool\script?

  3. The monitoring and terminating of this is done by the internal system processes of Global Mirror, so just to be clear, monitoring the latency added by GM tag-and-release processing is not a user action, and neither is terminating the busiest relationship.

    I’m not sure what level of insight TPC will give you into the latency added by GM tag-and-release processing. I expect TPC will be able to pick up 1920 events.You can also just look in the event log. This is what Angelo was referring to – you can write a script to check the log for 1920 and then issue a start on the relationship to get it going again.

    If you’re doing that a lot however, my thought is that it might be better to build in more tolerance up-front instead, by setting gmlinktolerance to 600 rather than the 300 default, or setting gmmaxhostdelay to 10 ms instead of the default 5 ms.

  4. Can you elaborate on this statement:

    “Immediately we realize that if we are replicating a lot of standalone volumes that it might make sense to create an empty cg0 so that we spread things around a little.”

    I do not need consistency groups, all of my volumes are stand-alone. I probably won’t need more than 16 mirrored volumes. Would it be beneficial to create a consistency group with one volume in it for each of the mirrored volumes? Right now I assume they are all using cg0 as they are stand-alone.

    • I need to check with Hursley on this question to be sure…

      ———————————————————————-
      Answer rewritten now that the dev guys have explained it to me…
      ———————————————————————-

      OK, the story is that:

      1) Within each stream, all writes are processed in tag sequence order, so any holdups in processing a write can slow down others behind it in the stream. Having more streams (up to 16 anyway) reduces this kind of potential congestion.

      2) Each stream is sequence-tag-processed by one node. You could ideally have n * N consistency groups, where N is the number of SVC/V7000 nodes/canisters, and n is any positive integer, so as to spread the processing evenly across all nodes.

      Thanks for the great question – we’re all learning together!

      Thanks, Jim

  5. Hi Jim;

    Right now I try to replicate around 1000 volumes and placed those into 26 gc’s. Does that mean, gc0 and gc16 use the same stream? If so, distribution of gc’s to streams become so important, am I right?

    Thanks, very good article.

    • Yes, cg0 and cg16 are in the same stream. It’s worth being aware of the load balance so as to avoid having all the busiest volumes in one stream, but it doesn’t need to be a perfect balance.

  6. Great information… You statement regarding extra space for GM: “Note also that the size of a 15 minute cycling change volume is typically going to be less than 1% of the source volumes, so you don’t need a lot of extra space for this.” clears up for me the question of space overhead at the primary site. Is the same true for the secondary site? How would one actually calculate the extra space required at both sites using either traditional GM or GM with change volumes?

    Thanks!

    • The volumes created must have the same nominal size as the original volume, but they will be thin provisioned, so the nominal size is not so important. To calculate the size of the data is just the change rate of the volume, over the time between snaps. I guess you need to base it on peak change times – so in the case of a database log say, it might change a lot, but in the case of a file volume, only a very small amount. If you get 10% over 10 hours for a database, then maybe it peaks at 3% in the busiest hour. etc. The best approach is to try this with a subset of volumes and see what happens (and report back would be good).

      “In theory, practice and theory are the same, but in practice they are different.” : )

  7. VMWare with SRM doesn’t like to have things in CG’s. So by default, everything VM winds up in CG0. Any good way to deal with this in a large VM environment?

    • A good question.
      I guess at least you could take your 15 heaviest volumes and put them one each in cg1 to cg15, and the rest in cg0.
      Maybe SRM is best used with GM/CV to get around this? If we have that certified yet.
      I would maybe need to ask a few other people if they have any other suggestions.

  8. What do you think about using the -rate option to limit the MBps on some of your 1920 offenders?? It would be nice if the -rate option allowed you to specify “writes only”.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 101 other followers

%d bloggers like this: