FCIP Routers – A Best Practice Design Tip

Many years ago a Glaswegian friend of mine quoted someone as saying that the 1981 anti-apartheid protests in New Zealand (South African rugby tour) showed that New Zealand was not just a floating Surrey as some had previously suspected. While the Surrey reference might be lost on those not from England, I can tell you there are some distinct cultural and language differences between NZ and England.

For example, there was a (not very good) punk band called ‘Rooter’ back in the late 1970′s in New Zealand. They ended up having to change their name to The Terrorways because ‘Rooter’ was  considered too offensive by the managers of many pubs and clubs.

I guess that’s why in NZ we always pronounce ‘router’ to rhyme with ‘shouter’ even though we pronounce ‘route’ to rhyme with ‘shoot’. We’re kind of stuck in the middle between British and American English.

Pronunciation issues aside however, FCIP routers are a highly reliable way to connect fabrics and allow replication over the WAN between fibre channel disk systems. The price of FCIP routers seems to have halved over the last year or so, which is handy and live replicated DR sites have become much more commonplace in the midrange space in the last couple of years.

Apart from the WAN itself (which is the source of most replication problems) there are a couple of other things that it’s good to be aware of when assembling a design and bill of materials for FCIP routers.

  1. When you’re using the IBM SAN06B-R (Brocade 7800) we always recommend including the licence for ‘Integrated Routing’ if you’re going out over the WAN. This prevents the fabrics at either end of an FCIP link from merging. If a WAN link bounces occasionally as many do, you want to protect your fabrics from repeatedly having to work out who’s in charge and stalling traffic on the SAN while they do that. Without IR your WAN FCIP environment might not really even be supportable.
  2. Similarly I usually recommend the ‘Advanced Performance Monitoring’ feature. If you run into replication performance problems APM will tell you what the FC app is actually seeing rather than you having to make assumptions based on IP network tools.
  3. The third point is new to me and was the real trigger for this blog post (thanks to Alexis Giral for his expertise in this area) and that is if you have only one router per site (as most do) then best practice is to connect only one fabric at each site as per the diagram below.

The reason for this is that the routers and the switches all run the same FabricOS and there is a small potential for an error to be propagated across fabrics, even though Integrated Routing supposedly isolates the fabrics. This is something that Alexis tells me he has explored in detail with Brocade and they too recommend this as a point of best practice. If you already have dual-fabric connected single routers then I’m not sure the risk is high enough to warrant a reconfiguration, but if you’re starting from scratch you should not connect them all up. This would also apply if you are using Cisco MDS922i and MDS91xx for example, as all switches and routers would be running NXOS and the same potential for error propagation exists.

Advertisement

2 Responses

  1. I wanted to make a few comments about Extension infrastructures. I have been consulting, designing, building and troubleshooting FCIP architectures for over a decade now for the largest of companies, working with Nishan & CNT -> McDATA -> Brocade, Brocade 7500/FR, and now Brocade 7800/FX. Oh yeah, and Cisco 9222i 18/4 & SSN-16 (however the Cisco products are very low performance products,so, not so much). The Brocade 7800/FX are the “Who’s your Daddy!” of Extension devices.

    First, IBM is a very competent organization for Extension solutions. Kudos!

    Second, I am going to use the terminology “Channel Extender” for the 7800/FX because they were designed to be high performance and high efficiency extenders, unlike the 7500/FR which are FC Routers prior to Condor2 ASIC. Although, 7800/FX have FCR capability, as do all modern Brocade switches/directors, this is now just a function of the Condor2 ASIC and not specific to any platform. The 7800/FX are channel extenders for FC and FICON and very popular in mainframe environments. There is a LOT of technology in these boxes taken from both CNT and Nishan Systems.

    Third, what does IR do? It enables FCR (FC Routing) which provides a demarcation point (EX_Port) for fabric services. Fabric services are terminated at this demarcation point and do not extend across the WAN, therefore, the edge fabric services are not susceptible to WAN faults. Devices communicate through their local edge fabric to a local proxy device. FCR transports it across a Backbone Fabric, which may or may not be FCIP. This fabric should be designed without end devices attached so that if it suffers from WAN faults there will be no disruption. This is best practice.

    Now on with it…
    I’m not so sure that an IR licence is the best recommendation whenever going over the WAN. This falls firmly into the category of “It depends”. Here are the situations in which an IR license is really needed:

    Extending Production Fabrics
    When connecting production fabrics on each side, which is not a foregone conclusion. Let’s say you have a DS8800 or XIV, it is best practice to connect the channel extenders directly to the array and not go through the production fabric.

    Tape
    When you need ubiquitous connectivity for tape, it is best to go through a local fabric and the IR license is of value.

    IBM SVC
    When using IBM SVC, it is required that there be ubiquitous connectivity, therefore, the channel extenders will connect via the production fabric and the production fabric should be isolated using FCR.

    No Dedicated RDR Array Ports
    When you have an array that does not have dedicated RDR (Remote Data Replication) ports you need to connect that port to a fabric so that the port not only sees the hosts but also the channel extender, IR would be useful here. If a port can be dedicated to RDR then that is best practice.

    If there is no need to connect to the production fabrics, then IR is not really required and incurs additional expense. BTW, RDR and tape are not typically considered production, albeit, they may be connected via a production fabric.

    The license that is really needed almost every time is the AEX (Advanced Extension License). Almost never is this not needed. AEX gets you FCIP Trunking and ARL (Adaptive Rate Limiting). If one Ethernet interface from a 2 box Brocade 7800/FX solution is going into a single WAN link that is not shared (dedicated BW), then it is possible to not use the AEX license.

    Let’s look at the example diagram above. ARL is needed when more than one connection from the 7800/FX goes into the IP network and shares one or more WAN links with limited available bandwidth, ARL is needed to mitigate the use of that limited bandwidth (BW). It also permits the use of that BW when a connection from the 7800/FX is offline (bad optic, broken cable, maintenance, configuration change…). In a 4 box solution, it is even more important because there are usually 4 connections for box redundancy plus link redundancy. With static rate limiting, the rate limiter would not be able to adapt to failure conditions and there would be no resiliency of the FCIP infrastructure.

    FCIP Trunking is a mechanism that allows multiple FCIP circuits to form into a single ISL from the perspective of the fabrics. A single VE_Port is used on each side of the FCIP Trunk. This gets you the following: Aggregate BW, Lossless Link Loss (LLL is the ability to not lose data that has been lost in-flight when a backhoe takes out a fiber path), Automated Fail-over and Fail-back, IOD (In-Order Deliver), No FSPF routing changes. In this case above, the two red circuits from a 7800 would be FCIP Trunked into a one FCIP ISL. Since each circuit can be assigned to a different GE interface, there can be an aggregate of 2 Gbps of FCIP BW. If one of those links were to be lost, the data lost in flight would be resent over the remaining link and placed back in-order before being sent to the ULP (Upper Layer Protocols).

    Lastly, 2 box vs. 4 box solutions. Obviously, best practice is always a 4 box solution, however, this is not always practical for all size enterprises. Nonetheless, it is never considered good practice to connect a single 7800 to both the A & B production fabrics. This is not a matter of not joining fabrics together and preventing that by using FCR. That’s not it at all. It is a matter of a single Linux kernel that is running fabric services to both A & B up to that demarcation point. It is conceivable that a problem (Linux, HW, FOS, or human error) can cause both the A & B fabrics to fail at the same time. Yes, this means an entire SAN wide outage and a very bad day for all those involved. This is not a problem if the FCIP fabric is separate from the production fabric, only if connecting to the both production fabrics. Best practice for 2 box solution is to not connect to production fabric, BTW, that is what I believe is shown above. A completely separate RDR/Tape network. Or, if last resorts, only extend the “A” fabric and not the “B” until you get budget to buy 2 more boxes for “B”.

    Best Regards,
    Mark Detrick
    BCFP, BCAF, BCNE, CISSP, CCIE

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 62 other followers