There has been a strong motivation and desire for co-existence of Hadoop and HPC clusters. While it is true that the HPC world would like to optimize their clusters and leverage MapReduce, it is also true that the Hadoop world would like to invoke MPI apps from their framework. Let’s look at some of the options which may enable this.
The Advanced Edition of Platform Symphony includes an Apache Hadoop compatible MapReduce implementation optimized for low latency, reliability and resource sharing. Along with IBM Infosphere BigInsights (IBM’s hadoop distribution), Symphony delivers a multi-tenant, heterogeneous application cluster with higher utilization and performance. It ensures efficient sharing to ensure multi-tenancy at both workload and resource layer.
(Image source: IBM presentation - link)
Research projects :
myHadoop was developed by Sriram Krishnan and Shava Smallen from San Diego Supercomputer Center (SDSC). It aims to provision Hadoop instances on traditional supercomputing resources on the fly via regular batch scripts. This open source tool has been tested on SDSC Triton, TeraGrid and UC Grid resources.
|Hadoop on HPC: |
(Image source – Sriram Krishnan’s presentation –link)
MR job adaptor was developed by Marcelo Neves, Tiago Ferreto, and Cesar De Rose of PUCRS in
Brazil. The adaptor aims to:
- Allows transparent MR job submission on HPC clusters
– Minimizes the average turnaround time
– Improve the overall utilization, by exploiting unused resources in the cluster
On the shelf :
(Open MPI) :
MR+ has been one of the most ambitious projects on the subject. The project caught fancy of Greenplum team but it is unclear at this time if it will be available for general use. The project claimed to be 10x faster than YARN and aimed support for multiple HPC environments (rsh, SLURM, Torque,
Alps, LSF, Windows, etc.). It was also
claimed that Mappers and reducers could be written in any of the typical HPC
languages (C, C++, and Fortran) as well as Java.
Hadoop On Demand (HOD) was the original open source project on the subject but for some strange reasons was abandoned instead of being upgraded or redesigned. HOD could provision virtual Hadoop clusters over a large physical cluster and used Torque resource manager to do node allocation. On the allocated nodes, it could start Hadoop Map/Reduce and HDFS daemons.
In the reckoning :
The veritable choice of Twitter, AirBnb and UC Berkley, Mesos is a platform which can run Hadoop, MPI, Hypertable, Spark and other applications. Supercomputing success stories with Mesos though are still awaited from the wider use base.
While YARN is the future of open source MapReduce, there is still lack of clarity on absolute integration with HPC. While project Hamster was touted as MPI plug-in for YARN, it instead made the MR+ journey to Greenplum as listed above. There has been a very recent contribution of MPI on YARN with MPICH2. We will wait and watch this space.