Skip to main content

Options for MapReduce with HPC

There has been a strong motivation and desire for co-existence of Hadoop and HPC clusters. While it is true that the HPC world would like to optimize their clusters and leverage MapReduce, it is also true that the Hadoop world would like to invoke MPI apps from their framework. Let’s look at some of the options which may enable this.

Commercial Enterprise Product :

The Advanced Edition of Platform Symphony includes an Apache Hadoop compatible MapReduce implementation optimized for low latency, reliability and resource sharing. Along with IBM Infosphere BigInsights (IBM’s hadoop distribution), Symphony delivers a multi-tenant, heterogeneous application cluster with higher utilization and performance. It ensures efficient sharing to ensure multi-tenancy at both workload and resource layer.

 ( Image source: IBM presentation - link)

Research projects :

myHadoop was developed by Sriram Krishnan and Shava Smallen from San Diego Supercomputer Center (SDSC). It aims to provision Hadoop instances on traditional supercomputing resources on the fly via regular batch scripts. This open source tool has been tested on SDSC Triton, TeraGrid and UC Grid resources.
Hadoop on HPC: Main Challenges
 (Image source – Sriram Krishnan’s presentation –link)

MR job adaptor was developed by Marcelo Neves, Tiago Ferreto, and Cesar De Rose of PUCRS in Brazil. The adaptor aims to:
- Allows transparent MR job submission on HPC clusters
– Minimizes the average turnaround time
– Improve the overall utilization, by exploiting unused resources in the cluster

On the shelf :

  • MR+ (Open MPI) :

MR+ has been one of the most ambitious projects on the subject. The project caught fancy of Greenplum team but it is unclear at this time if it will be available for general use. The project claimed to be 10x faster than YARN and aimed support for multiple HPC environments (rsh, SLURM, Torque, Alps, LSF, Windows, etc.). It was also claimed that Mappers and reducers could be written in any of the typical HPC languages (C, C++, and Fortran) as well as Java.

Hadoop On Demand (HOD) was the original open source project on the subject but for some strange reasons was abandoned instead of being upgraded or redesigned. HOD could provision virtual Hadoop clusters over a large physical cluster and used Torque resource manager to do node allocation. On the allocated nodes, it could start Hadoop Map/Reduce and HDFS daemons. 

In the reckoning :

The veritable choice of Twitter, AirBnb and UC Berkley, Mesos is a platform which can run Hadoop, MPI, Hypertable, Spark and other applications. Supercomputing success stories with Mesos though are still awaited from the wider use base.
Mesos can be used to:

  • Run multiple instances of Hadoop on the same cluster to isolate production and experimental jobs, or even multiple versions of Hadoop.
  • Run long-lived services (e.g. Hypertable and HBase) on the same nodes as batch applications and share resources between them.
  • Build new cluster computing frameworks without reinventing low-level facilities for farming out tasks, and have them coexist with existing ones.

While YARN is the future of open source MapReduce, there is still lack of clarity on absolute integration with HPC. While project Hamster was touted as MPI plug-in for YARN, it instead made the MR+ journey to Greenplum as listed above. There has been a very recent contribution of MPI on YARN with MPICH2. We will wait and watch this space.


Popular posts from this blog

Hadoop's 10 in LinkedIn's 10

LinkedIn, the pioneering professional social network has turned 10 years old. One of the hallmarks of its journey has been its technical accomplishments and significant contribution to open source, particularly in the last few years. Hadoop occupies a central place in its technical environment powering some of the most used features of desktop and mobile app. As LinkedIn enters the second decade of its existence, here is a look at 10 major projects and products powered by Hadoop in its data ecosystem.
1)      Voldemort:Arguably, the most famous export of LinkedIn engineering, Voldemort is a distributed key-value storage system. Named after an antagonist in Harry Potter series and influenced by Amazon’s Dynamo DB, the wizardry in this database extends to its self healing features. Available in HA configuration, its layered, pluggable architecture implementations are being used for both read and read-write use cases.
2)      Azkaban:A batch job scheduling system with a friendly UI, Azkab…

Data deduplication tactics with HDFS and MapReduce

As the amount of data continues to grow exponentially, there has been increased focus on stored data reduction methods. Data compression, single instance store and data deduplication are among the common techniques employed for stored data reduction.
Deduplication often refers to elimination of redundant subfiles (also known as chunks, blocks, or extents). Unlike compression, data is not changed and eliminates storage capacity for identical data. Data deduplication offers significant advantage in terms of reduction in storage, network bandwidth and promises increased scalability.
From a simplistic use case perspective, we can see application in removing duplicates in Call Detail Record (CDR) for a Telecom carrier. Similarly, we may apply the technique to optimize on network traffic carrying the same data packets.
Some of the common methods for data deduplication in storage architecture include hashing, binary comparison and delta differencing. In this post, we focus on how MapReduce and…

Top Big Data Influencers of 2015

2015 was an exciting year for big data and hadoop ecosystem. We saw hadoop becoming an essential part of data management strategy of almost all major enterprise organizations. There is cut throat competition among IT vendors now to help realize the vision of data hub, data lake and data warehouse with Hadoop and Spark.
As part of its annual assessment of big data and hadoop ecosystem, HadoopSphere publishes a list of top big data influencers each year. The list is derived based on a scientific methodology which involves assessing various parameters in each category of influencers. HadoopSphere Top Big Data Influencers list reflects the people, products, organizations and portals that exercised the most influence on big data and ecosystem in a particular year. The influencers have been listed in the following categories:

AnalystsSocial MediaOnline MediaProductsTechiesCoachThought LeadersClick here to read the methodology used.

Analysts:Doug HenschenIt might have been hard to miss Doug…