Skip to main content

Detecting DDoS hacking attempt with MapReduce and Hadoop

Distributed Denial of Service (DDoS) are one of the common attempts in security hacking for making computation resources unavailable or to impair geographical networks. To analyze such attack patterns in network usage, Hadoop and Map Reduce can step in. While Map Reduce detects packet traffic anomalies, the scalable Hadoop architecture offers solutions for processing data in a reasonable response time.

In a paper published by Y.Lee and Y.Lee, Detecting DDoSAttacks with Hadoop, ACM CoNEXT Student Workshop, 2011, the authors present Map Reduce based algorithms which can be implemented on packet analysis while leveraging Hadoop for parallel processing.

There are two distinct algorithms that have been proposed:
-          Counter based method: This method relies on three key parameters: time interval which is the duration during which packets are to be analyzed, threshold which indicates frequency of requests and unbalance ratio which denotes the anomaly ratio of response per page requested between specific client and server.

“The masked timestamp with time interval is usd for counting the number of requests from a specific client to the specific URL within the same time duration. The reduce function summarizes the number of URL requests, page requests, and server responses between a client and a server. Finally, the algorithm aggregates values per server”

When the threshold is crossed and the unbalance ratio is higher than normal, the clients are marked as attackers.

The key advantage of utilizing this algorithm is obviously the low complexity as we would agree. The authors also indicate that threshold value determination could be a key deciding factor in the implementation without offering any further information on how to determinate the value. Based on our knowledge from other Hadoop implementations, we know that the same packet traffic data could be a rich mine for extracting the threshold value and unbalance ratio. We have seen in other implementations how Hadoop can be effectively utilized to analyze logs and arrive at statistical trends and patterns.

-          Access pattern based method: Talking of patterns, the authors move on the next algorithm for determining the attack. Here they rely on a pattern which differentiates the normal traffic from a DDoS traffic.
This method requires more than two MapReduce jobs:
the first job obtains access sequence to the web page between a client and a web server and calculates the spending time and the bytes count for each request of the URL;
the second job hunts out infected hosts by comparing the access sequence and the spending time among clients trying to access the same server.”

What this essentially implies is that if two clients are having a same DDoS bot they could be trying to use the same access sequence (access resource A à B à C à…Z) and have a very high likelihood of spending same amount of time and exchange same amount of data while accessing A or B or Z. This indicates suspicious behavior and indicates bot behavior rather than normal human interaction behavior. Remember, the analysis here is on HTTP GET requests which are made more for human interaction rather than bot interactions.
Obviously such computation is expensive and systems today have some way to go to do this big data analysis in reasonable time. The other challenge with both these methods is that Hadoop still has a lot of orientation towards batch processing. Many systems are trying to come to near real time mimic for Hadoop. In case you have a story to share for the real time Hadoop processing, drop a comment over here or on the More section of this site.

The bigger challenge that the authors have tried to address over here is the scalability issue. By leveraging Hadoop for parallel data processing utilizing the inherent architecture of master- slave nodes, they have cut down on processing time while being able to deal with more and more volumes of incoming data.


  1. Last year I worked on designing and developing the real-time abnormal network traffic system for billing and preventing abnormal usage of networking resources at Cloud center, Korea Telecom. Apache Hama was used. I can't share details but here's few slides.

    1. Thank you Edward for sharing your implementation experience. Your slide deck on slideshare points to an interesting analytics methodology. We would love to see more details.

  2. Respected author,

    I am a College student doing a project on implementation of this paper. I am yet unclear about the dataset for input. Please Help. How can I contact you regarding this over mail?

  3. Thanks for your interest. You may refer the contact coordinates mentioned in the paper.


Post a Comment

Popular articles

5 online tools in data visualization playground

While building up an analytics dashboard, one of the major decision points is regarding the type of charts and graphs that would provide better insight into the data. To avoid a lot of re-work later, it makes sense to try the various chart options during the requirement and design phase. It is probably a well known myth that existing tool options in any product can serve all the user requirements with just minor configuration changes. We all know and realize that code needs to be written to serve each customer’s individual needs. To that effect, here are 5 tools that could empower your technical and business teams to decide on visualization options during the requirement phase. Listed below are online tools for you to add data and use as playground. 1)      Many Eyes : Many Eyes is a data visualization experiment by IBM Research and the IBM Cognos software group. This tool provides option to upload data sets and create visualizations including Scatter Plot, Tree Ma

Data deduplication tactics with HDFS and MapReduce

As the amount of data continues to grow exponentially, there has been increased focus on stored data reduction methods. Data compression, single instance store and data deduplication are among the common techniques employed for stored data reduction. Deduplication often refers to elimination of redundant subfiles (also known as chunks, blocks, or extents). Unlike compression, data is not changed and eliminates storage capacity for identical data. Data deduplication offers significant advantage in terms of reduction in storage, network bandwidth and promises increased scalability. From a simplistic use case perspective, we can see application in removing duplicates in Call Detail Record (CDR) for a Telecom carrier. Similarly, we may apply the technique to optimize on network traffic carrying the same data packets. Some of the common methods for data deduplication in storage architecture include hashing, binary comparison and delta differencing. In this post, we focus o

In-memory data model with Apache Gora

Open source in-memory data model and persistence for big data framework Apache Gora™ version 0.3, was released in May 2013. The 0.3 release offers significant improvements and changes to a number of modules including a number of bug fixes. However, what may be of significant interest to the DynamoDB community will be the addition of a gora-dynamodb datastore for mapping and persisting objects to Amazon's DynamoDB . Additionally the release includes various improvements to the gora-core and gora-cassandra modules as well as a new Web Services API implementation which enables users to extend Gora to any cloud storage platform of their choice. This 2-part post provides commentary on all of the above and a whole lot more, expanding to cover where Gora fits in within the NoSQL and Big Data space, the development challenges and features which have been baked into Gora 0.3 and finally what we have on the road map for the 0.4 development drive. Introducing Apache Gora Although