Skip to main content

Data deduplication tactics with HDFS and MapReduce

As the amount of data continues to grow exponentially, there has been increased focus on stored data reduction methods. Data compression, single instance store and data deduplication are among the common techniques employed for stored data reduction.

Deduplication often refers to elimination of redundant subfiles (also known as chunks, blocks, or extents). Unlike compression, data is not changed and eliminates storage capacity for identical data. Data deduplication offers significant advantage in terms of reduction in storage, network bandwidth and promises increased scalability.

From a simplistic use case perspective, we can see application in removing duplicates in Call Detail Record (CDR) for a Telecom carrier. Similarly, we may apply the technique to optimize on network traffic carrying the same data packets.

Some of the common methods for data deduplication in storage architecture include hashing, binary comparison and delta differencing. In this post, we focus on how MapReduce and HDFS can be leveraged for eliminating duplicate data. (The approach listed below includes certain experimental approach by students and so may be best termed as tactics.)

Tactic 1: Using HDFS and MapReduce only

Owen O’Malley has suggested the following approach in one of the forums:
Keep your historic data sorted by md5.

Run a MapReduce job to sort your new data into md5 order. Note that
you want a total order, but because the md5's are evenly spaced across
the key space this is easy. Basically, you pick a number of reduces
(eg. 256) and then use the top N bits of the MD5 to pick your reduce.
Since this job is only processing your new data, it is very fast.

Next you do a map-side join where each input split consists of an md5
range. The RecordReader reads from the historic and new datasets
merging them as they go. (You can use the map-side join library for
this.) Your map does the merge of the new and old. This is a map-only
job, so it also is very fast.

Of course if the new data is small enough you can read all of the new
input in each of the maps and just keep (and sort in ram) the new
records that are in the right range and do the merge from ram. This
lets you avoid the step where you sort the new data. This kind of
merge optimization is where Pig and Hive hide lots of the details from
the developer.

Tactic 2: Using HDFS and HBase

In a paper “A novel approach to data deduplication over the engineering-oriented cloud systems” by Zhe Sun, Jun Shen, Jianming Young suggested a method involving HDFS and HBase, which includes:
-          Use MD5 and SHA-1 hash functions to calculate the file's hash value and then pass the value to HBase.
-          Compare the new hash value with the existing values. If it exists earlier in HBase deduplication table, HDFS will check the number of links, and if the number is not zero, the counter will be incremented by one. If the number is zero or hash value did not exist earlier, HDFS will ask the client to upload the file and update the logical path.
-          HDFS will store source files, which are uploaded by users, and corresponding link files, which are automatically generated. Link files record the source file's hash value and the logical path of the source file.

Some of the key items to note in this approach include:
-          file level deduplication to keep the index as small as possible in order to achieve high lookup efficiency.
-          MD5 and SHA-1 values are merged together to avoid accidental collision

Tactic 3: Using HDFS, MapReduce and a Storage Controller

In a paper “Distributed Duplicate Detection in Post-Process Data De-duplication” by Netapp engineers Ashish Kathpal & Gaurav Makkar  and by Mathew John,  the authors propose to replace the duplicate detection stage of NetApp with duplicate detection mechanism that uses Hadoop MapReduce. The proposed Hadoop based duplicate detection workflow included the following stages:
- Move the fingerprints from the storage controller to HDFS
- Generate fingerprint database and store it persistently on the Hadoop Distributed File System (HDFS).
-Generate the duplicate records from the fingerprints using MapReduce and send it back to the storage controller.

Fingerprints are computed hash indexes of data chunks in the storage system. Since the fingerprints are typically much smaller in size as compared to the data chunk they represent, it involves lesser movement of data across the network for duplicate detection.

Tactic 4: Using Streaming, HDFS and MapReduce

There can be two possible scenarios for Hadoop and Streaming application integration. As demonstrated in IBM Infosphere Streams and BigInsights integration, these scenarios could be:
(a)  Streams to Hadoop flow: Using a control flow that utilizes Hadoop MapReduce model as part of stream analysis, the operators on Stream observe and deduplicate the incoming data to update and validate the MapReduce model.
Since it is recognized that it is most efficient to remove the duplicate records in this data as it is being ingested, records are deduplicated for a particular time range, count or defined delta in InfoSphere Streams. The deduplicated data is then sink(ed) to Hadoop BigInsights for building a fresh model.

(b)  Hadoop to Streams flow: In this approach, Hadoop MapReduce is utilized for removing duplicate data from historical data and then the model is updated. The model in integrated as part of Streams flow and an operator can be configured to pick up the model mid-stream and start applying it to incoming data.

Tactic 5: Using MapReduce with Blocking techniques

In a prototype tool Dedoop (Deduplication with Hadoop) developed by University of Leipzig, MapReduce has been utilized for entity resolution of large datasets. This tool by far shows the most mature use of MapReduce for data deduplication.

Blocking based entity matching is used to semantically partition the input data into blocks of similar records and restricting to entities of the same block.

Entity resolution processing is split into 2 MapReduce jobs - Analysis job used primarily to count occurrences and Match job for load balancing & similarity computation. Also, it utilizes greedy load balancing where match tasks are sorted in descending order by their size and assigned to fewest loaded reduce task.

Dedoop also employs effective techniques for avoiding redundant pair comparisons.  It claims to enable MR program to unambiguously determine which reduce task is responsible for any pair comparison eliminating need for same comparison happening on multiple nodes.


  1. Can I get coding for mapreduce in Java for data reduplication ..


  2. Superb. I really enjoyed very much with this article here. Really it is an amazing article I had ever read. I hope it will help a lot for all. Thank you so much for this amazing posts and please keep update like this excellent article.thank you for sharing such a great blog with us. expecting for your.

    hadoop administration training
    big data administrator training


Post a Comment

Popular articles

5 online tools in data visualization playground

While building up an analytics dashboard, one of the major decision points is regarding the type of charts and graphs that would provide better insight into the data. To avoid a lot of re-work later, it makes sense to try the various chart options during the requirement and design phase. It is probably a well known myth that existing tool options in any product can serve all the user requirements with just minor configuration changes. We all know and realize that code needs to be written to serve each customer’s individual needs. To that effect, here are 5 tools that could empower your technical and business teams to decide on visualization options during the requirement phase. Listed below are online tools for you to add data and use as playground. 1)      Many Eyes : Many Eyes is a data visualization experiment by IBM Research and the IBM Cognos software group. This tool provides option to upload data sets and create visualizations including Scatter Plot, Tree Ma

In-memory data model with Apache Gora

Open source in-memory data model and persistence for big data framework Apache Gora™ version 0.3, was released in May 2013. The 0.3 release offers significant improvements and changes to a number of modules including a number of bug fixes. However, what may be of significant interest to the DynamoDB community will be the addition of a gora-dynamodb datastore for mapping and persisting objects to Amazon's DynamoDB . Additionally the release includes various improvements to the gora-core and gora-cassandra modules as well as a new Web Services API implementation which enables users to extend Gora to any cloud storage platform of their choice. This 2-part post provides commentary on all of the above and a whole lot more, expanding to cover where Gora fits in within the NoSQL and Big Data space, the development challenges and features which have been baked into Gora 0.3 and finally what we have on the road map for the 0.4 development drive. Introducing Apache Gora Although