Comprestimator Guesstimator

Hey folks, just a quick post for you based on recent experience of IBM’s NAS Comprestimator utility for Storwize V7000 Unified where it completely failed to predict an outcome that I had personally predicted 100% accurately, based on common sense. The lesson here is that you should read the NAS Comprestimator documentation very carefully before you trust it (and once you read and understand it you’ll realize that there are some situations in which you simply cannot trust it).data-swamp

We all know that Comprestimator is a sampling tool right? It looks at your actual data and works out the compression ratio you’re likely to get… well, kind of…

Let’s look first at the latest IBM spiel at https://www-304.ibm.com/webapp/set2/sas/f/comprestimator/home.html

“The Comprestimator utility uses advanced mathematical and statistical algorithms to perform the sampling and analysis process in a very short and efficient way.”

Cool, advanced mathematical and statistical algorithms – sounds great!

But there’s a slightly different story told on an older page that is somewhat more revealing http://m.ibm.com/http/www14.software.ibm.com/webapp/set2/sas/f/comprestimator/NAS_Compression_estimation_utility.html

“The NAS Compression Estimation Utility performs a very efficient and quick listing of file directories. The utility analyzes file-type distribution information in the scanned directories, and uses a pre-defined list of expected compression rates per filename extension. After completing the directory listing step the utility generates a spreadsheet report showing estimated compression savings per each file-type scanned and the total savings expected in the environment.

It is important to understand that this utility provides a rough estimation based on typical compression rates achieved for the file-types scanned in other customer and lab environments. Since data contained in files is diverse and is different between users and applications storing the data, actual compression achieved will vary between environments. This utility provides a rough estimation of expected compression savings rather than an accurate prediction.

The difference here is that one is for NAS and one is for block, but I’m assuming that the underlying tool is the same. So, what if you have a whole lot of files with no extension? Apparently Comprestimator then just assumes 50% compression.

Below I reveal the reverse-engineered source code for the NAS Comprestimator when it comes to assessing files with no extension, and I release this under an Apache licence. Live Free or Die people.

#include<stdio.h>

int main()
{
printf(“IBM advanced mathematical and statistical algorithms predict the following compression ratio: 50% \n”);
return 0;
}

enjoy : )

 

 

Advertisements

One Response

  1. Jim,
    For the record I work for IBM and what follows is my own opinion.

    The key take-away from your post is : Read the notes for tools you download to understand what they deliver.

    Comprestimator on block devices has been so useful and reliable it has been integrated in the Spectrum Virtualize line GUI and the Accelerate line GUI as a feature to help administrators precisely estimate the efficiency before they apply compression – neat (that I assume you knew).

    I think it is fair for readers of your post to be aware of certain facts (since everything is mixed up in the same basket):

    Fact #1: Comprestimator for block devices is based on mathematical algorithms as stated in tool URL: http://www14.software.ibm.com/webapp/set2/sas/f/comprestimator/home.html. It assesses the “compressibility” of block devices by sampling the actual content of the raw disk with low error margin.

    Fact#2: The way Comprestimator for NAS operates is not hidden – we do not have to dig out an “old page”, read any fine prints pointed by a hidden star or reverse engineer anything. It is out in the open and the link you refer to is not a secret – it is the main page for the tool and the first 2 sentences of the page are quite clear: “The utility analyzes file-type distribution information in the scanned directories, and uses a pre-defined list of expected compression rates per filename extension”

    Fact#3: General purpose workload will generate on average a 2:1 compression ratio.

    Instead of spitting out crazy efficiency ratio, I have always enjoyed being able to provide clients with a constructed, analytical assessment. When this is not possible, then 2 questions should be asked:
    Question A – what is the data type ? If known, how much savings would be typically yield ?
    Question B- is the data already compressed, or known to be non-compressible?

    You can either manually run question A past every single file – or you can have a tool doing it for you. In the cases where this is not possible AND the answer to question B is no, then we can look at the assessment by considering Fact #3.

    Now, there is so much any tool or system or machine can achieve. And while you seem to have considered common sense in this recent experience, it is not clear how far you went to apply it.

    Could that be the second key take-away of your post ?

    Liked by 1 person

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: