I am starting to see the term ‘data reduction’ cropping up all over the place and being used to mean either compression or deduplication. I have a couple of objections to this.
- It’s not an accurate descriptor
- It’s bad manners to commandeer an existing term from another discipline and redefine it to mean something completely different.
Data reduction is an existing statistical term for summarizing your data, i.e. rounding, averaging etc. i.e. dropping off some of the data detail in order to see the bigger picture or trend. Reducing a large amount of detailed data to a small amount of summarised data.
When Netapp or EMC or IBM claim to be reducing your data, by using dedup or compression, I really hope they are not reducing your data! I hope they are just coming up with ways to store the same data in a more compact format. So, data compaction may be a more appropriate term. I guess that data compression could be used to cover both dedup and LZW compression, but it’s probably useful to keep separate terms for those as they have separate implications for performance.
As an aside to all of this, methods for storing data in more compact formats are not really the end game. There is a broader issue. For example, if I need a bunch of extra vault drives just to make my system work, or if my storage file system has a 25% space overhead, or if my snapshots of tier1 volumes have to be stored on tier1, or if my snapshots are full copies rather than thin copies, or if I don’t support production volume thin provisioning in high performance environments, or if I need to use lots of 15KRPM drives to get decent performance when other can do it with SATA or 10KRPM SAS in combination with distributed caching or SSDs etc etc, then whether I have dedup or compression is just part of the bigger picture of how efficient my storage is.
So then you need to start talking about overall storage efficiency rather than just compression or dedup, and no discussion of storage efficiency is complete without considering ease of management and also costs. Smart technologies can be made inefficient by fine print limitations, or by prohibitive pricing so that a product without dedup or compression might still have lower TCO which would effectively make it a more efficient product.
So let’s try to avoid getting lost down any technology rat holes whilst searching for the sunlight of storage efficiency. By all means go down them and explore them (I’ll join you) but let’s not get lost down there.
So, in summary of my main point:
- Let’s not be bad mannered and redefine someone else’s existing terminology.
- Let’s try to use accurate terms to describe what we do
Rant over : )