A pedant’s rant on “Data Reduction”

I am starting to see the term ‘data reduction’ cropping up all over the place and being used to mean either compression or deduplication. I have a couple of objections to this.

  1. It’s not an accurate descriptor
  2. It’s bad manners to commandeer an existing term from another discipline and redefine it to mean something completely different.

Data reduction is an existing statistical term for summarizing your data, i.e. rounding, averaging etc. i.e. dropping off some of the data detail in order to see the bigger picture or trend. Reducing a large amount of detailed data to a small amount of summarised data.

When Netapp or EMC or IBM claim to be reducing your data, by using dedup or compression, I really hope they are not reducing your data! I hope they are just coming up with ways to store the same data in a more compact format. So, data compaction may be a more appropriate term. I guess that data compression could be used to cover both dedup and LZW compression, but it’s  probably useful to keep separate terms for those as they have separate implications for performance.

As an aside to all of this, methods for storing data in more compact formats are not really the end game. There is a broader issue. For example, if I need a bunch of extra vault drives just to make my system work, or if my storage file system has a 25% space overhead, or if my snapshots of tier1 volumes have to be stored on tier1, or if my snapshots are full copies rather than thin copies, or if I don’t support production volume thin provisioning in high performance environments, or if I need to use lots of 15KRPM drives to get decent performance when other can do it with SATA or 10KRPM SAS in combination with distributed caching or SSDs etc etc, then whether I have dedup or compression is just part of the bigger picture of how efficient my storage is.

So then you need to start talking about overall storage efficiency rather than just compression or dedup, and no discussion of storage efficiency is complete without considering ease of management and also costs. Smart technologies can be made inefficient by fine print limitations, or by prohibitive pricing so that  a product without dedup or compression might still have lower TCO which would effectively make it a more efficient product.

So let’s try to avoid getting lost down any technology rat holes whilst searching for the sunlight of storage efficiency. By all means go down them and explore them (I’ll join you) but let’s not get lost down there.

So, in summary of my main point:

  • Let’s not be bad mannered and redefine someone else’s existing terminology.
  • Let’s try to use accurate terms to describe what we do

Rant over  : )


4 Responses

  1. […] Data reduction Posted on September 22, 2010 by rogerluethy| Leave a comment Good blog from the Storage Buddhist on the topic Data reduction. This pops up again and again (does it mean […]


  2. I agree these terms are used loosely and can be misleading. As far as data reduction, it is true that dedupe and compression will reduce the amount of data stored on disks (think physical data) but all the data is actually still there (think logical data).

    Because of this potential confusion in terms, SNIA has favored the term “capacity optimization”. Many leading storage vendors (including NetApp, my employer) have adopted this term, but old habits are hard to break…

    For more on this, refer to http://www.snia.com/dpco



  3. SB, sorry about the typo and thanks for posting the correct URL.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: