Tuesday, July 24, 2012

What Is Data De-Duplication and How It Works?

Hello all, 

Data De-Duplication means comparing objects(files and blocks) and removing all duplicated objects and  keep only unique objects. So, the results is a smaller group of data(files and blocks). For example,


If you look at the picture above, de-duplication has removed all duplicated data. The result is a smaller group of data. And, of course, there is a way to reproduce the original data. I’ll explain that.

What is the advantage of De-duplication?
  1.  Reduced backup cost:  you can save a huge amount of data in terms of size when you save de-duplicated data to tape or sending backups to remote site using WAN or LAN.
  2. Reduced WAN and LAN Bandwidth: Using de-duplicated data, you can save bandwidth and reduce the cost of using WAN if you send your data to a remote site. 
  3. Reduced hardware cost: you need less tape and harddisk  
  4. Increased efficiency of storage  
De-Duplicated methods:
  1. File-based comparison and then compression:This method, which is an old method, uses operation system or application to compare files, for example, comparing the name, size, type, and date of modification. If all parameter matches, you can remove one of them. If you use a file-compression method, you can save more space too. In this method, the de-duplication ratio would be 2:1 or 3:1, which 50 % less data. This can be done through a script or operation system and it’s free.  
  2. File level hashing:It’s like file-based method but it’s more intelligent. File level hashing creates a unique mathematical hash for files. Then, it will compare the hashes for new files with the originals one. If the hashes match each other, it means the files are the same and it can be removed. This method requires an index table to store the hashes and it can be referenced quickly for match. Usually, the indexes are stored in RAM and they are very quick and don’t slow down the process of hash look up. 
  3. Block level hashing:  It’s the same concept as File level hashing but it works with the block of data. So, it’s independent of file system in OS or files themselves. So, the block of data means the way that data is stored on disk and it doesn’t care about the type of data. De-Duplication uses hash for blocks of data and compares every new block of data being stored through the de-duplication and it will remove the equal blocks.  
  4. Sub-Block level hashing: It works like Block level hashing but it’s much more sufficient. This is the most common method that used in enterprise today. It divides or slices a block of data into a set of sub blocks with a specific size. For example, it gets a 64KB block of data and creates 4 segments of 16KB each. Then it creates a unique hash for each slice or chunk of data.

The hashes are stored in index hash table and it starts to compare the hashes. As you see in picture, the hashes for chunk number 1 and 3 are equal. So, it removes the duplicated chunk of data for segment 3 and it puts its hash as a pointer so the original data can be restored later.
 Please look at the picture below:

Segment 3 is removed and replaced with the hash as a pointer and now the original file takes less space. The original file can be rebuilt by replacing the hash for segment 3 with the data in segment 1.

       5.   Delta versioning: I will explain this method in my next blog post since it needs more explanations.

Hope you enjoyed.

Khosro Taraghi


  1. Phew, quiet a bit to process, but thanks for your explanation Khosro. I've heard about data deduplication and seen a few sites that offer their services, but I still want to know more about this new technology. Thanks again for your contribution.

  2. That is so much clearer to me now. I had no idea what de-duplication was. I had heard my cousin and friends talking about it. But I had no clue what they were saying. I was just nodding and going along with it. hahah. Kind of sad.