Skip to main content

Options for MapReduce with HPC

There has been a strong motivation and desire for co-existence of Hadoop and HPC clusters. While it is true that the HPC world would like to optimize their clusters and leverage MapReduce, it is also true that the Hadoop world would like to invoke MPI apps from their framework. Let’s look at some of the options which may enable this.

Commercial Enterprise Product :

The Advanced Edition of Platform Symphony includes an Apache Hadoop compatible MapReduce implementation optimized for low latency, reliability and resource sharing. Along with IBM Infosphere BigInsights (IBM’s hadoop distribution), Symphony delivers a multi-tenant, heterogeneous application cluster with higher utilization and performance. It ensures efficient sharing to ensure multi-tenancy at both workload and resource layer.

 ( Image source: IBM presentation - link)

Research projects :

myHadoop was developed by Sriram Krishnan and Shava Smallen from San Diego Supercomputer Center (SDSC). It aims to provision Hadoop instances on traditional supercomputing resources on the fly via regular batch scripts. This open source tool has been tested on SDSC Triton, TeraGrid and UC Grid resources.
Hadoop on HPC: Main Challenges
 (Image source – Sriram Krishnan’s presentation –link)

MR job adaptor was developed by Marcelo Neves, Tiago Ferreto, and Cesar De Rose of PUCRS in Brazil. The adaptor aims to:
- Allows transparent MR job submission on HPC clusters
– Minimizes the average turnaround time
– Improve the overall utilization, by exploiting unused resources in the cluster

On the shelf :

  • MR+ (Open MPI) :

MR+ has been one of the most ambitious projects on the subject. The project caught fancy of Greenplum team but it is unclear at this time if it will be available for general use. The project claimed to be 10x faster than YARN and aimed support for multiple HPC environments (rsh, SLURM, Torque, Alps, LSF, Windows, etc.). It was also claimed that Mappers and reducers could be written in any of the typical HPC languages (C, C++, and Fortran) as well as Java.

Hadoop On Demand (HOD) was the original open source project on the subject but for some strange reasons was abandoned instead of being upgraded or redesigned. HOD could provision virtual Hadoop clusters over a large physical cluster and used Torque resource manager to do node allocation. On the allocated nodes, it could start Hadoop Map/Reduce and HDFS daemons. 

In the reckoning :

The veritable choice of Twitter, AirBnb and UC Berkley, Mesos is a platform which can run Hadoop, MPI, Hypertable, Spark and other applications. Supercomputing success stories with Mesos though are still awaited from the wider use base.
Mesos can be used to:

  • Run multiple instances of Hadoop on the same cluster to isolate production and experimental jobs, or even multiple versions of Hadoop.
  • Run long-lived services (e.g. Hypertable and HBase) on the same nodes as batch applications and share resources between them.
  • Build new cluster computing frameworks without reinventing low-level facilities for farming out tasks, and have them coexist with existing ones.

While YARN is the future of open source MapReduce, there is still lack of clarity on absolute integration with HPC. While project Hamster was touted as MPI plug-in for YARN, it instead made the MR+ journey to Greenplum as listed above. There has been a very recent contribution of MPI on YARN with MPICH2. We will wait and watch this space.


Popular posts from this blog

Data deduplication tactics with HDFS and MapReduce

As the amount of data continues to grow exponentially, there has been increased focus on stored data reduction methods. Data compression, single instance store and data deduplication are among the common techniques employed for stored data reduction.
Deduplication often refers to elimination of redundant subfiles (also known as chunks, blocks, or extents). Unlike compression, data is not changed and eliminates storage capacity for identical data. Data deduplication offers significant advantage in terms of reduction in storage, network bandwidth and promises increased scalability.
From a simplistic use case perspective, we can see application in removing duplicates in Call Detail Record (CDR) for a Telecom carrier. Similarly, we may apply the technique to optimize on network traffic carrying the same data packets.
Some of the common methods for data deduplication in storage architecture include hashing, binary comparison and delta differencing. In this post, we focus on how MapReduce and…

Pricing models for Hadoop products

A look at the various pricing models adopted by the vendors in the Hadoop ecosystem. While the pricing models are evolving in this rapid and dynamic market, listed below are some of the major variations utilized by companies in the sphere.
1) Per Node:Among the most common model, the node based pricing mechanism utilizes customized rules for determining pricing per node. This may be as straight forward as pricing per name node and data node or could have complex variants of pricing based on number of core processors utilized by the nodes in the cluster or per user license in case of applications.
2) Per TB:The data based pricing mechanism charges customer for license cost per TB of data. This model usually accounts non replicated data for computation of cost.
3) Subscription Support cost only:In this model, the vendor prefers to give away software for free but charges the customer for subscription support on a specified number of nodes. The support timings and level of support further …

5 online tools in data visualization playground

While building up an analytics dashboard, one of the major decision points is regarding the type of charts and graphs that would provide better insight into the data. To avoid a lot of re-work later, it makes sense to try the various chart options during the requirement and design phase. It is probably a well known myth that existing tool options in any product can serve all the user requirements with just minor configuration changes. We all know and realize that code needs to be written to serve each customer’s individual needs.
To that effect, here are 5 tools that could empower your technical and business teams to decide on visualization options during the requirement phase. Listed below are online tools for you to add data and use as playground.
1)      Many Eyes: Many Eyes is a data visualization experiment by IBM Researchandthe IBM Cognos software group. This tool provides option to upload data sets and create visualizations including Scatter Plot, Tree Map, Tag/Word cloud and ge…