Skip to main content

Options for MapReduce with HPC

There has been a strong motivation and desire for co-existence of Hadoop and HPC clusters. While it is true that the HPC world would like to optimize their clusters and leverage MapReduce, it is also true that the Hadoop world would like to invoke MPI apps from their framework. Let’s look at some of the options which may enable this.

Commercial Enterprise Product :

The Advanced Edition of Platform Symphony includes an Apache Hadoop compatible MapReduce implementation optimized for low latency, reliability and resource sharing. Along with IBM Infosphere BigInsights (IBM’s hadoop distribution), Symphony delivers a multi-tenant, heterogeneous application cluster with higher utilization and performance. It ensures efficient sharing to ensure multi-tenancy at both workload and resource layer.

 ( Image source: IBM presentation - link)

Research projects :

myHadoop was developed by Sriram Krishnan and Shava Smallen from San Diego Supercomputer Center (SDSC). It aims to provision Hadoop instances on traditional supercomputing resources on the fly via regular batch scripts. This open source tool has been tested on SDSC Triton, TeraGrid and UC Grid resources.
Hadoop on HPC: Main Challenges
 (Image source – Sriram Krishnan’s presentation –link)

MR job adaptor was developed by Marcelo Neves, Tiago Ferreto, and Cesar De Rose of PUCRS in Brazil. The adaptor aims to:
- Allows transparent MR job submission on HPC clusters
– Minimizes the average turnaround time
– Improve the overall utilization, by exploiting unused resources in the cluster

On the shelf :

  • MR+ (Open MPI) :

MR+ has been one of the most ambitious projects on the subject. The project caught fancy of Greenplum team but it is unclear at this time if it will be available for general use. The project claimed to be 10x faster than YARN and aimed support for multiple HPC environments (rsh, SLURM, Torque, Alps, LSF, Windows, etc.). It was also claimed that Mappers and reducers could be written in any of the typical HPC languages (C, C++, and Fortran) as well as Java.

Hadoop On Demand (HOD) was the original open source project on the subject but for some strange reasons was abandoned instead of being upgraded or redesigned. HOD could provision virtual Hadoop clusters over a large physical cluster and used Torque resource manager to do node allocation. On the allocated nodes, it could start Hadoop Map/Reduce and HDFS daemons. 

In the reckoning :

The veritable choice of Twitter, AirBnb and UC Berkley, Mesos is a platform which can run Hadoop, MPI, Hypertable, Spark and other applications. Supercomputing success stories with Mesos though are still awaited from the wider use base.
Mesos can be used to:

  • Run multiple instances of Hadoop on the same cluster to isolate production and experimental jobs, or even multiple versions of Hadoop.
  • Run long-lived services (e.g. Hypertable and HBase) on the same nodes as batch applications and share resources between them.
  • Build new cluster computing frameworks without reinventing low-level facilities for farming out tasks, and have them coexist with existing ones.

While YARN is the future of open source MapReduce, there is still lack of clarity on absolute integration with HPC. While project Hamster was touted as MPI plug-in for YARN, it instead made the MR+ journey to Greenplum as listed above. There has been a very recent contribution of MPI on YARN with MPICH2. We will wait and watch this space.


Popular posts from this blog

Offloading legacy with Hadoop

With most Fortune 500 organizations having invested in mainframes and other workload systems in the past, the rise of Big Data platforms poses newer integration challenges. The data integration and ETL players are finding fresh opportunities to solve business and IT problems within the Hadoop ecosystem.
To understand the context, challenges and opportunities, we asked a few questions to Syncsort CEO Lonne Jaffe. Syncsort provides fast, secure, enterprise-grade software spanning Big Data in Apache Hadoop to Big Iron on mainframes. At Syncsort, Lonne Jaffe is focusing on accelerating the growth of the company's high-performance Big Data offerings, both organically and through acquisition.
From mainframes to Hadoop and other platforms, Syncsort seems to have been evolving itself continuously. Where do you see Syncsort heading further?Lonne Jaffe: Syncsort is extraordinary in its ability to continuously reinvent itself. Today, we’re innovating around Apache Hadoop and other Big Data pla…

Data deduplication tactics with HDFS and MapReduce

As the amount of data continues to grow exponentially, there has been increased focus on stored data reduction methods. Data compression, single instance store and data deduplication are among the common techniques employed for stored data reduction.
Deduplication often refers to elimination of redundant subfiles (also known as chunks, blocks, or extents). Unlike compression, data is not changed and eliminates storage capacity for identical data. Data deduplication offers significant advantage in terms of reduction in storage, network bandwidth and promises increased scalability.
From a simplistic use case perspective, we can see application in removing duplicates in Call Detail Record (CDR) for a Telecom carrier. Similarly, we may apply the technique to optimize on network traffic carrying the same data packets.
Some of the common methods for data deduplication in storage architecture include hashing, binary comparison and delta differencing. In this post, we focus on how MapReduce and…

Top Big Data Influencers of 2015

2015 was an exciting year for big data and hadoop ecosystem. We saw hadoop becoming an essential part of data management strategy of almost all major enterprise organizations. There is cut throat competition among IT vendors now to help realize the vision of data hub, data lake and data warehouse with Hadoop and Spark.
As part of its annual assessment of big data and hadoop ecosystem, HadoopSphere publishes a list of top big data influencers each year. The list is derived based on a scientific methodology which involves assessing various parameters in each category of influencers. HadoopSphere Top Big Data Influencers list reflects the people, products, organizations and portals that exercised the most influence on big data and ecosystem in a particular year. The influencers have been listed in the following categories:

AnalystsSocial MediaOnline MediaProductsTechiesCoachThought LeadersClick here to read the methodology used.

Analysts:Doug HenschenIt might have been hard to miss Doug…