Detecting DDoS hacking attempt with MapReduce and Hadoop

Distributed Denial of Service (DDoS) are one of the common attempts in security hacking for making computation resources unavailable or to impair geographical networks. To analyze such attack patterns in network usage, Hadoop and Map Reduce can step in. While Map Reduce detects packet traffic anomalies, the scalable Hadoop architecture offers solutions for processing data in a reasonable response time.

In a paper published by Y.Lee and Y.Lee, Detecting DDoSAttacks with Hadoop, ACM CoNEXT Student Workshop, 2011, the authors present Map Reduce based algorithms which can be implemented on packet analysis while leveraging Hadoop for parallel processing.

There are two distinct algorithms that have been proposed:
-          Counter based method: This method relies on three key parameters: time interval which is the duration during which packets are to be analyzed, threshold which indicates frequency of requests and unbalance ratio which denotes the anomaly ratio of response per page requested between specific client and server.

“The masked timestamp with time interval is usd for counting the number of requests from a specific client to the specific URL within the same time duration. The reduce function summarizes the number of URL requests, page requests, and server responses between a client and a server. Finally, the algorithm aggregates values per server”

When the threshold is crossed and the unbalance ratio is higher than normal, the clients are marked as attackers.

The key advantage of utilizing this algorithm is obviously the low complexity as we would agree. The authors also indicate that threshold value determination could be a key deciding factor in the implementation without offering any further information on how to determinate the value. Based on our knowledge from other Hadoop implementations, we know that the same packet traffic data could be a rich mine for extracting the threshold value and unbalance ratio. We have seen in other implementations how Hadoop can be effectively utilized to analyze logs and arrive at statistical trends and patterns.

-          Access pattern based method: Talking of patterns, the authors move on the next algorithm for determining the attack. Here they rely on a pattern which differentiates the normal traffic from a DDoS traffic.
This method requires more than two MapReduce jobs:
the first job obtains access sequence to the web page between a client and a web server and calculates the spending time and the bytes count for each request of the URL;
the second job hunts out infected hosts by comparing the access sequence and the spending time among clients trying to access the same server.”

What this essentially implies is that if two clients are having a same DDoS bot they could be trying to use the same access sequence (access resource A à B à C à…Z) and have a very high likelihood of spending same amount of time and exchange same amount of data while accessing A or B or Z. This indicates suspicious behavior and indicates bot behavior rather than normal human interaction behavior. Remember, the analysis here is on HTTP GET requests which are made more for human interaction rather than bot interactions.
Obviously such computation is expensive and systems today have some way to go to do this big data analysis in reasonable time. The other challenge with both these methods is that Hadoop still has a lot of orientation towards batch processing. Many systems are trying to come to near real time mimic for Hadoop. In case you have a story to share for the real time Hadoop processing, drop a comment over here or on the More section of this site.

The bigger challenge that the authors have tried to address over here is the scalability issue. By leveraging Hadoop for parallel data processing utilizing the inherent architecture of master- slave nodes, they have cut down on processing time while being able to deal with more and more volumes of incoming data.


  1. Last year I worked on designing and developing the real-time abnormal network traffic system for billing and preventing abnormal usage of networking resources at Cloud center, Korea Telecom. Apache Hama was used. I can't share details but here's few slides.

    1. Thank you Edward for sharing your implementation experience. Your slide deck on slideshare points to an interesting analytics methodology. We would love to see more details.

  2. Respected author,

    I am a College student doing a project on implementation of this paper. I am yet unclear about the dataset for input. Please Help. How can I contact you regarding this over mail?

  3. Thanks for your interest. You may refer the contact coordinates mentioned in the paper.