Data deduplication tactics with HDFS and MapReduce


As the amount of data continues to grow exponentially, there has been increased focus on stored data reduction methods. Data compression, single instance store and data deduplication are among the common techniques employed for stored data reduction.

Deduplication often refers to elimination of redundant subfiles (also known as chunks, blocks, or extents). Unlike compression, data is not changed and eliminates storage capacity for identical data. Data deduplication offers significant advantage in terms of reduction in storage, network bandwidth and promises increased scalability.

From a simplistic use case perspective, we can see application in removing duplicates in Call Detail Record (CDR) for a Telecom carrier. Similarly, we may apply the technique to optimize on network traffic carrying the same data packets.

Some of the common methods for data deduplication in storage architecture include hashing, binary comparison and delta differencing. In this post, we focus on how MapReduce and HDFS can be leveraged for eliminating duplicate data. (The approach listed below includes certain experimental approach by students and so may be best termed as tactics.)

Tactic 1: Using HDFS and MapReduce only


Owen O’Malley has suggested the following approach in one of the forums:
Keep your historic data sorted by md5.

Run a MapReduce job to sort your new data into md5 order. Note that
you want a total order, but because the md5's are evenly spaced across
the key space this is easy. Basically, you pick a number of reduces
(eg. 256) and then use the top N bits of the MD5 to pick your reduce.
Since this job is only processing your new data, it is very fast.

Next you do a map-side join where each input split consists of an md5
range. The RecordReader reads from the historic and new datasets
merging them as they go. (You can use the map-side join library for
this.) Your map does the merge of the new and old. This is a map-only
job, so it also is very fast.

Of course if the new data is small enough you can read all of the new
input in each of the maps and just keep (and sort in ram) the new
records that are in the right range and do the merge from ram. This
lets you avoid the step where you sort the new data. This kind of
merge optimization is where Pig and Hive hide lots of the details from
the developer.



Tactic 2: Using HDFS and HBase


In a paper “A novel approach to data deduplication over the engineering-oriented cloud systems” by Zhe Sun, Jun Shen, Jianming Young suggested a method involving HDFS and HBase, which includes:
-          Use MD5 and SHA-1 hash functions to calculate the file's hash value and then pass the value to HBase.
-          Compare the new hash value with the existing values. If it exists earlier in HBase deduplication table, HDFS will check the number of links, and if the number is not zero, the counter will be incremented by one. If the number is zero or hash value did not exist earlier, HDFS will ask the client to upload the file and update the logical path.
-          HDFS will store source files, which are uploaded by users, and corresponding link files, which are automatically generated. Link files record the source file's hash value and the logical path of the source file.

Some of the key items to note in this approach include:
-          file level deduplication to keep the index as small as possible in order to achieve high lookup efficiency.
-          MD5 and SHA-1 values are merged together to avoid accidental collision


Tactic 3: Using HDFS, MapReduce and a Storage Controller


In a paper “Distributed Duplicate Detection in Post-Process Data De-duplication” by Netapp engineers Ashish Kathpal & Gaurav Makkar  and by Mathew John,  the authors propose to replace the duplicate detection stage of NetApp with duplicate detection mechanism that uses Hadoop MapReduce. The proposed Hadoop based duplicate detection workflow included the following stages:
- Move the fingerprints from the storage controller to HDFS
- Generate fingerprint database and store it persistently on the Hadoop Distributed File System (HDFS).
-Generate the duplicate records from the fingerprints using MapReduce and send it back to the storage controller.

Fingerprints are computed hash indexes of data chunks in the storage system. Since the fingerprints are typically much smaller in size as compared to the data chunk they represent, it involves lesser movement of data across the network for duplicate detection.


Tactic 4: Using Streaming, HDFS and MapReduce


There can be two possible scenarios for Hadoop and Streaming application integration. As demonstrated in IBM Infosphere Streams and BigInsights integration, these scenarios could be:
(a)  Streams to Hadoop flow: Using a control flow that utilizes Hadoop MapReduce model as part of stream analysis, the operators on Stream observe and deduplicate the incoming data to update and validate the MapReduce model.
Since it is recognized that it is most efficient to remove the duplicate records in this data as it is being ingested, records are deduplicated for a particular time range, count or defined delta in InfoSphere Streams. The deduplicated data is then sink(ed) to Hadoop BigInsights for building a fresh model.

(b)  Hadoop to Streams flow: In this approach, Hadoop MapReduce is utilized for removing duplicate data from historical data and then the model is updated. The model in integrated as part of Streams flow and an operator can be configured to pick up the model mid-stream and start applying it to incoming data.


Tactic 5: Using MapReduce with Blocking techniques


In a prototype tool Dedoop (Deduplication with Hadoop) developed by University of Leipzig, MapReduce has been utilized for entity resolution of large datasets. This tool by far shows the most mature use of MapReduce for data deduplication.

Blocking based entity matching is used to semantically partition the input data into blocks of similar records and restricting to entities of the same block.

Entity resolution processing is split into 2 MapReduce jobs - Analysis job used primarily to count occurrences and Match job for load balancing & similarity computation. Also, it utilizes greedy load balancing where match tasks are sorted in descending order by their size and assigned to fewest loaded reduce task.

Dedoop also employs effective techniques for avoiding redundant pair comparisons.  It claims to enable MR program to unambiguously determine which reduce task is responsible for any pair comparison eliminating need for same comparison happening on multiple nodes.