1920 – a famous event [code]

IBM SAN Volume Controller and Storwize V7000 Global Mirror
_____________________________________________________________

1920 was a big year with many famous events. Space does not permit me to mention them all, so please forgive me if your significant event of 1920 is left off the list:

  • In the US the passing of the 18th Ammendment starts prohibition
  • In the US the passing of the 19th Ammendment gives women the vote [27 years after women in New Zealand had the same right].
  • The Covenant of the League of Nations (and the ILO) come into force, but the US decides not to sign (in part because it grants the league the right to declare war)
  • The US Senate refuses to sign the treaty of Versailles (in part because it was considered too harsh on Germany)
  • Bloody Sunday – British troops open fire on spectators and players during a football match in Dublin killing 14 Irish civilians and wounding 65.
  • Anti-capitalists bomb Wall Street, killing 38 and seriously injuring 143
  • Numerous other wars and revolutions

There is another famous 1920 event however – event code 1920 on IBM SAN Volume Controller and Storwize V7000 Global Mirror, and this event is much less well understood. A 1920 event code tells you that Global Mirror has just deliberately terminated one of the volume relationships you are replicating, in order to maintain good host application performance. It is not an error code as such, it is the result of automated intelligent monitoring and decision making by your Global Mirror system. I’ve been asked a couple of times why Global Mirror doesn’t automatically restart a relationship that has just terminated with a 1920 event code. Think about it. The system has just taken a considered decision to terminate the relationship, why would it then restart it? If you don’t care about host impact then you can set GM up so that it doesn’t terminate it in the first place, but don’t set it up to terminate on host impact and then blindly just restart it as soon as it does what you told it to do. 1920 is a form of congestion control. Congestion can be at any point in the end to end solution:

  • Network bandwidth, latency, QoS
  • SVC/V7000 memory contention
  • SVC/V7000 processor contention
  • SVC/V7000 disk overloading

Before I explain how the system makes the decision to terminate, first let me summarize your options for avoiding 1920. That’s kind of back to front, but everyone wants to know how to avoid 1920 and not so many people really want to know the details of congestion control. Possible methods for avoiding 1920 are: (now includes a few updates in green and a few more added later in red)

  1. Ask your IBM storage specialist or IBM Business Partner about using Global Mirror with Change Volumes (RPO of minutes) rather than traditional Global Mirror (RPO of milliseconds). You’ll need to be at version 6.3 or later of the firmware to run this. Note that VMware SRM support should be in place for GM/CV by the end of September 2012. Note also that the size of a 15 minute cycling change volume is typically going to be less than 1% of the source volumes, so you don’t need a lot of extra space for this.
  2. Ensure that you have optimized your streams – create more consistency groups, and create an empty cg0 if you are using standalone volumes. 
  3. Increase the GMmaxhostdelay parameter from its default of 5 milliseconds. The system monitors the extra host I/O latency due to the tag-and-release processing of each batch of writes, and if this goes above GMmaxhostdelay then the system considers that an undesirable situation.
  4. Increase the GMlinktolerance parameter from its default of 300 seconds. This is the window over which GM tolerates latency exceeding GMmaxhostdelay before deciding to terminate. Although it has been suggested you should not increase this in a VMware environment.
  5. Increase your network bandwidth, your network quality, your network QoS settings or reduce your network latency. Don’t skimp on your network. Buy the licence for performance Monitoring on your FCIP router (e.g. 2498-R06 feature code 7734  “R06 Performance Monitor”). I’m told that using that or using TPC are the two best ways to see what is happening with traffic from a FC perspective. I’m told that looking at traffic/load from an IP traffic monitor is not always going to give you the real story about the replication traffic.
  6. If your SVC/V7000 is constrained then add another I/O group to the system, or more disks at both ends if it is disk constrained. In particular don’t try to run Global Mirror from a busy production SAS/SSD system to a DR system with NL-SAS. You might be able to do that with GM/CV but not with traditional GM.
  7. Make sure there are no outstanding faults showing in the event log.

So now lets move on to actually understanding the approach that SVC/V7000 takes to congestion control. First we need to understand streams. A GM partnership has 16 streams. All standalone volume relationships go into stream 0, consistency group 0 also goes into stream 0, consistency group 1 goes into stream 1, consistency group 2 goes into stream 2, etc, wrapping around as you get beyond 15. Immediately we realize that if we are replicating a lot of standalone volumes that it might make sense to create an empty cg0 so that we spread things around a little. Also, within each stream, each batch of writes must be processed in tag sequence order so having more streams (up to 16 anyway) reduces any potential for one write I/O to get caught in sequence behind a slower one. Also, each stream is sequence-tag-processed by one node. You could ideally have consistency groups in perfect multiples of the number of SVC/V7000 nodes/canisters, so as to spread the processing evenly across all nodes.OK, now let’s look at a few scenarios:

GMmaxhostdelay at 5 ms (default)
GMlinktolerance at 300 seconds (default)
  • If more than a third of the I/Os are slow and that happens repeatedly for 5 minutes, then the internal system controls will terminate the busiest relationship in that stream.
  • The default settings are looking for general slowness in host response caused by the use of GM
  • Maybe you’d be willing to change GMlinktolerance to 600 seconds (10 minutes) and tolerate more impact at peak periods?
GMmaxhostdelay at 100 ms
GMlinktolerance at 30 seconds
  •  If more than a third of the I/Os are extremely slow and that happens repeatedly for 30 seconds, then the internal system controls will terminate the busiest relationship in the stream
  • Looking for short periods of extreme slowness
  • This has been suggested as something to use (after doing your own careful testing) in a VMware environment given that VMware does not tolerate long-outstanding I/Os.

GMlinktolerance at 0 seconds

  • Set gmlinktolerance to 0 and the link will ‘never’ go down even if host I/O is badly affected. This was the default behaviour back in the very early days of SVC/V7000 replication.

At a slightly more detailed level, an approximation of how the gmlinktolerance and gmmaxhostdelay are used together is as follows:

  1. Look every 10 seconds and see if more than a third of the I/Os in any one stream were delayed by more than gmmaxhostdelay
  2. If more than a third were slow then we increase a counter by one for that stream, and if not we decrease the counter by one.
  3. If the counter gets to gmlinktolerance/10 then terminate the busiest relationship in the stream (and issue event code 1920)

Hopefully this goes some way to explaining that event code 1920 is an intelligent parameter-driven means of minimizing host performance impact, it’s not a defect in GM. The parameters give you a lot of freedom to choose how you want to run things, you don’t have to stay with the defaults.

Solving another kind of Global Mirror problem back in 1920.

Advertisements

36 Responses

  1. Great post. I hate 1920 and we are going to implement a script to autorestartb GM better then the old one cited in redbook and coming from my past.

    Like

    • Do you really think automating a restart is the right approach? If you don’t intend to do anything about the congestion that’s being signalled, wouldn’t you be better just to increase GMlinktolerance so the relationship doesn’t terminate quite so readily in the first place?

      Like

      • The nice thing about having the 1920 and restarting it automatically is that it doesn’t slow down the host during the re-sync. We’re actually running into problems where the GMLinkTolerance is causing unacceptable response time issues on our hosts.

        It would be nice if we could figure out a way to make GM run faster, but it doesn’t even fill up our replication pipe.

        Like

        • It would be interesting to know where your bottleneck is. I saw one recently where sata drives at the DR site were the cause of the 1920s. You need to look at all the elements carefully and work out where it is.
          Or if you want to make your life really simple, switch to global mirror with change volumes (cyclingmode).

          Like

          • We’d like to know, too. We’re using 15K FC drives at both ends. Same basic config at each end. 8 node SVC CF8 Cluster, DS8700. Cisco FCIP.

            My best guess at this point is the 30ms latency between sites. SVC doesn’t seem to stream the data down very efficiently.

            Like

            • Check out the post on SANslide. It may be that it’s partly a generic latency-stealing-your-bandwidth problem?
              Also Storwize code 7.2 has some Global Mirror performance improvements and I expect the next release to have even more improvements.

              Like

              • Jim – Been seeing some chatter about increasing the rcbuffersize at the target from 48 up to a max of 512. Very little documentation about it, though. My 1920 errors have escalated to almost 150 a day now across 5 SVC clusters. 60 second timeout, 5 ms host delay.

                I love SVC, but Global Mirror is going to be the nail in its coffin if we can’t get better throughput.

                Like

                • By 60 second timeout do you mean gmlinktolerance of 60 seconds? That would mean the counter would only get to 6 before taking a 1920 action.

                  Can you move some of the relationships to cycle mode?

                  The standard advice on rcbuffersize is to talk to L2 support before changing it. Have you tried logging a support call to get input from L2?

                  Like

                  • Yes, gmlinktolerance = 60. We had to change it down because it was having too much of an impact on our primary side response time.

                    We’re working on getting the VMWare stuff moved to GM/CV, but we still have to finish the testing, SRA updates, etc.

                    I’ll open a ticket with level 2 on the rcbuffersize. But it would be nice to see some documentation on it. If it only affects the target and only takes away potential cache memory at the target, I’m OK with increasing it. But is TPC tracking buffer utilization anywhere so we can tell if it’s getting overrun?

                    Like

    • Is there anyway to convert regular GM Sessions to GMCV via the TPC-R console? I know it’s possible via the SVC gui but we have many sessions / copy sets defined already in TPC-R..

      Like

  2. Piece of your blog:
    If more than a third of the I/Os are slow and that happens repeatedly for 5 minutes, then terminate the busiest relationship in that stream.

    Q: How do you check this? Do you use TPC or other tool\script?

    Like

  3. The monitoring and terminating of this is done by the internal system processes of Global Mirror, so just to be clear, monitoring the latency added by GM tag-and-release processing is not a user action, and neither is terminating the busiest relationship.

    I’m not sure what level of insight TPC will give you into the latency added by GM tag-and-release processing. I expect TPC will be able to pick up 1920 events.You can also just look in the event log. This is what Angelo was referring to – you can write a script to check the log for 1920 and then issue a start on the relationship to get it going again.

    If you’re doing that a lot however, my thought is that it might be better to build in more tolerance up-front instead, by setting gmlinktolerance to 600 rather than the 300 default, or setting gmmaxhostdelay to 10 ms instead of the default 5 ms.

    Like

  4. Can you elaborate on this statement:

    “Immediately we realize that if we are replicating a lot of standalone volumes that it might make sense to create an empty cg0 so that we spread things around a little.”

    I do not need consistency groups, all of my volumes are stand-alone. I probably won’t need more than 16 mirrored volumes. Would it be beneficial to create a consistency group with one volume in it for each of the mirrored volumes? Right now I assume they are all using cg0 as they are stand-alone.

    Like

    • I need to check with Hursley on this question to be sure…

      ———————————————————————-
      Answer rewritten now that the dev guys have explained it to me…
      ———————————————————————-

      OK, the story is that:

      1) Within each stream, all writes are processed in tag sequence order, so any holdups in processing a write can slow down others behind it in the stream. Having more streams (up to 16 anyway) reduces this kind of potential congestion.

      2) Each stream is sequence-tag-processed by one node. You could ideally have n * N consistency groups, where N is the number of SVC/V7000 nodes/canisters, and n is any positive integer, so as to spread the processing evenly across all nodes.

      Thanks for the great question – we’re all learning together!

      Thanks, Jim

      Like

  5. Hi Jim;

    Right now I try to replicate around 1000 volumes and placed those into 26 gc’s. Does that mean, gc0 and gc16 use the same stream? If so, distribution of gc’s to streams become so important, am I right?

    Thanks, very good article.

    Like

    • Yes, cg0 and cg16 are in the same stream. It’s worth being aware of the load balance so as to avoid having all the busiest volumes in one stream, but it doesn’t need to be a perfect balance.

      Like

  6. Great information… You statement regarding extra space for GM: “Note also that the size of a 15 minute cycling change volume is typically going to be less than 1% of the source volumes, so you don’t need a lot of extra space for this.” clears up for me the question of space overhead at the primary site. Is the same true for the secondary site? How would one actually calculate the extra space required at both sites using either traditional GM or GM with change volumes?

    Thanks!

    Like

    • The volumes created must have the same nominal size as the original volume, but they will be thin provisioned, so the nominal size is not so important. To calculate the size of the data is just the change rate of the volume, over the time between snaps. I guess you need to base it on peak change times – so in the case of a database log say, it might change a lot, but in the case of a file volume, only a very small amount. If you get 10% over 10 hours for a database, then maybe it peaks at 3% in the busiest hour. etc. The best approach is to try this with a subset of volumes and see what happens (and report back would be good).

      “In theory, practice and theory are the same, but in practice they are different.” : )

      Like

  7. VMWare with SRM doesn’t like to have things in CG’s. So by default, everything VM winds up in CG0. Any good way to deal with this in a large VM environment?

    Like

    • A good question.
      I guess at least you could take your 15 heaviest volumes and put them one each in cg1 to cg15, and the rest in cg0.
      Maybe SRM is best used with GM/CV to get around this? If we have that certified yet.
      I would maybe need to ask a few other people if they have any other suggestions.

      Like

  8. What do you think about using the -rate option to limit the MBps on some of your 1920 offenders?? It would be nice if the -rate option allowed you to specify “writes only”.

    Like

    • You could do that as a tactical fix I guess, but what about the app users? Global Mirror with Change Volumes might be a better approach. Maybe try GMCV with a cycle period of 5 minutes?

      With each release the developers are tuning Global Mirror to make it less susceptible to congestion and I expect that to continue over the next little while so that 1920s will become less and less of an issue over time.

      Like

  9. Hi Jim,

    Is there any restriction in creating consistent groups for GM ie in number , also would like to know about the number of relations which can be included in a group . Is there any restriction in that ?
    If we have are more member / relationship in a CG ie say about 50 or more , is this fine or we need to split it to new groups . So that they can replicate smoothly without throwing 1920’s

    Like

    • If you can spread it out across all 16 consistency groups would be ideal.

      The SVC 6.4 restrictions are listed here:
      http://www-01.ibm.com/support/docview.wss?uid=ssg1S1004115

      e.g.

      Remote Copy (Metro Mirror and Global Mirror) relationships per
      system => 8192. This can be any mix of Metro Mirror and Global Mirror relationships. Maximum requires an 8-node cluster (volumes per I/O group limit applies)

      Remote Copy relationships per consistency group =>No limit is imposed beyond the Remote Copy relationships per system limit

      Remote Copy consistency groups per system =>256

      Total Metro Mirror and Global Mirror volume capacity per I/O group => 1024 TB. This limit is the total capacity for all master and auxiliary volumes in the I/O group. Note: Do not use Volumes larger than 2TB in Global Mirror with Change Volumes relationships.

      Total number of Global Mirror with Change Volumes relationships per system => 256

      Like

  10. Thanks Jim, Currently there more that 40 groups. so basically splitting it to more CG groups will be good right ?
    One more thing is the code which is running is very older so i have certain limitation as well.
    basically the number 16 comes due to the number of streams available for replication ?

    Like

    • If you have 40 cg’s already then I’m not sure that splitting it into more will help you, since the system manages congestion by stream rather than by cg. There is a slight potential for optimization by having a multiple of 16, but that also depends on what is in cg0.

      The system has to maintain write order consistency across the whole cg so there is some overhead to managing a large cg’s, but I don’t know if it is significant. cg’s have two valid reasons for existing, one is for volumes that need to be lock-stepped, but the other is for manageability of an app environment.

      Your best bet is probably to start planning to get to 7.2 (and eventually 7.3). That is probably where you will see the biggest gains.

      But also take an all-encompassing approach – make sure your SVC nodes aren’t stressed, add more nodes, or upgrade to CG8 nodes if need be, check your disk stress levels, look at network latency, consider using GM/CV for some volumes etc.

      Like

  11. Hi Jim,

    my situation is there is a host with 7 TB of data about 60 + luns these are all replication . there are also other groups replication as well.weekly when a full backup of this 7 TB data to a different lun.
    The backup lun is a SAN lun single 7 TB. I am getting 1920 errors only when the full backup runs. Initially the backup and replication was running fine , there was some issue at the host side and it got crashed, this error started popping up after the host reboot during the crash .. any idea who will be the culprit ?

    Like

  12. Is there anyway to convert regular GM Sessions to GMCV via the TPC-R console? I know it’s possible via the SVC gui but we have many sessions / copy sets defined already in TPC-R..

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: