Bow ties are cool – When time and space collide

Every storage vendor has sales slides that tell us that data growth rates are accelerating and the world will explode soon unless you buy their product to manage that…

…and yet the average IT shop is still mostly doing backups the old fashioned way, with weekly fulls and daily incrementals, and scratching their heads about how they are going to cope next year, given that the current full is taking 48 hours. They probably have a whole bunch of SATA disk somewhere that acts as the initial backup target, but it doesn’t go faster than tape, which is something they probably assumed it would do when they bought it, but somehow they feel that their backups to disk are probably a good thing anyway even though they’re more expensive…

…and they may even have  a Data Domain dedup VTL plugged in somewhere, which seemed cool when they bought it, as long as you don’t think too hard about what it cost and how fast it goes or doesn’t go and that it turned out to be way too small and the upgrade to the bigger model’s not financially justifiable…

…and the problem is either data – too much of it – or time – not enough it, depending on your perspective. That’s why EMC called their point-in-time-copy software Timefinder way back when… but then they also use the name SnapView on CX and SnapSure on Celerra.

IBM calls it’s point-in-time-copy FlashCopy on most platforms, although on both XIV and SONAS they’re just called snapshots. Netapp uses Snapshot™ and tries to assert that as a registered trademark (I recall a performance monitoring tool called snapshot on mainframes as far back as the early 80’s). HDS calls it ShadowImage on USP and “Copy-on-Write Snapshot” on AMS.

So a plethora of names for much the same thing – a way to cheat time by creating an instant copy of data that would normally take hours to back up. There is one major difference between snapshot implementations, that is whether they are redirect-on-write (as in XIV and Netapp) or copy-on-write (as in pretty much everything else). COW has 50% higher disk seek  overheads but it lets you direct your snaps to a different class of disk (e.g. snap a tier1 vol onto tier3 disk). ROW is inherently in-place (i.e. on the same tier as the source volume) which might be a limitation on Netapp, but isn’t an issue on XIV because it’s a single tier design anyway.

A recent infosmack podcast on backups was a frustrating affair to listen to. I don’t mind a strong opinion if it’s backed by experience, but just because three guys on a podcast agree with each other doesn’t make what they say true (e.g. COW snaps are no good for backups).

Anyway, the question is, why don’t more people use snaps as a way to create backups. One reason has been that vendors often charge for the privilege of using snaps (examples of the few that don’t are IBM’s XIV, Netapp, and IBM’s SONAS). Also I suspect that up until now it hasn’t been too hard to do without snaps, but as data growth spirals (I know that’s happening because all those vendor powerpoints tell me it is) we need to start using smarter options.

IBM’s Tivoli Storage Manager has always done incremental forever backups, which is lot smarter than weekly full, daily incremental, but some people find TSM’s ‘trust me – I know what I’m doing’ approach scary. That tells me that to be popular in the mid-market, backup technology not only needs to work, but it needs to be emotionally reassuring.

Many combinations of snaps and replication and tape seem feasible at first glance, but you have to be careful that you are designing for restore, not just for backup. For example, replicating to a DR site then keeping daily snaps there, might be great for DR, but it might not make for a quick and easy restore at production.

This leads us towards products that provide app-aware snaps on your production systems e.g. Netapp’s SnapManager family, and IBM’s Tivoli Storage FlashCopy Manager about which I have previously blogged.

I’m still searching for the perfect design, and currently working on one for a real client situation, and at this stage the only thing I know for sure is that it will involve backups to both disk and tape. It’s proving to be quite a struggle to accommodate the demands of both time and space, so like Dr Who says, the least we can do is dress well and bow ties are cool…

Bow ties are cool

Advertisements

2 Responses

  1. I’m with the “snapshots are not backups” brigade I’m afraid, but I have to admit it is more about semantics.

    Borrowing from my employers terminology, a snapshot is a “recovery point”. It allows you to restore data quickly and effectively, and protects against file deletion and corruption.

    Because you don’t take a copy of all blocks, (only modified ones) there is a lot a snapshot doesn’t protect from. A failure of the raid group or of the meta data that tracks changed blocks (aggregate corruption in NetApp talk ) will trash both the primary and the snapshot. This leaves you searching for a real backup to recover from.

    There is still some blurred lines between a recovery point and a backup. Is a full copy (e.g. clone) on the same array a backup? If you loose the array you also lose the clone, the answer depends the client.

    One client I recently worked with didn’t consider a backup complete until the backup was offsite. They needed to guarantee they could restore from all backups even if they lost their primary site, so even a copy on tape wasn’t considered adequate protection unless it has been duplicated to their DR facility.

    Like

  2. There has been a lot of this going around recently:-

    1. A snapthing is not a backup! Correct! Nothing is a proper backup until it is geographically removed.
    2. Replication is not a backup! Correct! A single sync or async copy of the state of an array only provides one PIT which may not even be application consistent. A second failure could mean unrecoverable loss of data.

    A fully replicated setup of an array containing snapshots is a backup (with probably many PIT states). Does not matter if the snaps are taken at primary and replicated, or triggered at both sites simultaneously – as long as they are consistent. Even better is add in sync replicated log files for those apps that support them. The array should be geographically removed.

    3. If you lose your primary site and start using the replica you are running a risk of catastrophic data loss if you have a further failure at the second site! Correct!

    So you can include a third replica (on lowest cost storage) ideally at a third site, or tape backups as well (maybe at a different schedule – ie mission critical daily and descending frequency on data that is not so precious. Make sure you include moving them to a protected location.

    All tape backups assume you copy them back before using, thus meaning the data is protected. Cool! How often have we hit bad tapes when doing recovery? So does that mean we should always copy out the backup tapes twice – cloning is not ideal.

    What we should aim to get is two outcomes :-

    A. Remote set of data already online for rapid recovery – with PIT in depth so that possible corruption can be dealt with – here the logs really help.
    B. Backup copies of data for longer term Archive and selective restore or other use such as BI.

    From a design point of view I can never understand why it is seen as operationally OK to assume the cost of both backup software and copying every block of data to tape every day is the optimum cost effective solution.

    Fundamentally most commentators miss the point that Replication is an efficient Incremental process which places considerably less performance and operational requirements on the primary site, and this saving can contribute significantly to the secondary site investment.

    Looking at the investment in buildings, equipment and people, for thos businesses which require high uptime and DR capabilities, it really makes most sense to run a dual datacentre policy. load split live applications between them and act as mutual recovery sites. this way the assets can do usefull work at both sites with headroom provided for full load DR conditions. A third location can be used for long term backup archive repository where performance of supporting runtime systems is not required.

    This will reduce the size of any downtime event – only those served live by any site are subject to DR, save cost of total infrastructure (versus warm but unused equipment) and allow sensible staff location at both sites to remove the human risk in a disaster scenario.

    Flexibility for manual failover and maintenance is also dramatically improved.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: