With
disaster recovery the talk of the town after recent storm on US east coast,
let’s take a look at what options are available for High Availability
with various distributions of Hadoop.
Let’s begin with understanding the HA
issue in otherwise high reliable Apache Hadoop. An extract from Cloudera text:
“Before …
the NameNode was a single point of failure (SPOF) in an HDFS cluster. Each
cluster had a single NameNode, and if that machine or process became
unavailable, the cluster as a whole would be unavailable until the NameNode was
either restarted or brought up on a separate machine.
This
reduced the total availability of the HDFS cluster in two major ways:
1. In the case of an unplanned event such as a machine
crash, the cluster would be unavailable until an operator restarted the
NameNode.
2.
Planned maintenance events such as software or hardware upgrades on the
NameNode machine would result in periods of cluster downtime.”
Now let us
look at various HA options available:
(1) Cloudera:
1. 1 Quorum-based Storage
“Quorum-based Storage
refers to the HA implementation that uses Quorum Journal Manager (QJM).In order for the Standby node to keep its state synchronized with the Active node in this implementation, both nodes communicate with a group of separate daemons called JournalNodes…In the event of a failover, the Standby will ensure that it has read all of the edits from the JournalNodes before promoting itself to the Active state. This ensures that the namespace state is fully synchronized before a failover occurs.”
1. 2 Shared Storage Using NFS
“In order for the
Standby node to keep its state synchronized with the Active node, this
implementation requires that the two nodes both have access to a directory on a
shared storage device (for example, an NFS mount from a NAS). When any namespace modification is performed by the Active node, it durably logs a record of the modification to an edit log file stored in the shared directory. The Standby node constantly watches this directory for edits, and when edits occur, the Standby node applies them to its own namespace. In the event of a failover, the Standby will ensure that it has read all of the edits from the shared storage before promoting itself to the Active state. This ensures that the namespace state is fully synchronized before a failover occurs.”
More
details are available in the Cloudera High Availability Guide
(2) HortonWorks
2. 1
HortonWorks and VMware HA
“… jointly
developed Hortonworks Data Platform High Availability (HA) Kit for VMware
vSphere customers that enables full stack high availability for Hadoop 1.0 by
eliminating the NameNode and JobTracker single points of failure. It is a
flexible virtual machine-based high availability solution that integrates with
the VMware vSphere™ platform’s HA functionality to monitor and automate
failover for NameNode and JobTracker master services running within the
Hortonworks Data Platform (HDP).”
2. 2 Linux HA
Excerpt
from FAQ which gives the gist of Linux HA
“If this was so easy to do with Linux HA or other tools why didn’t the
HDFS community do this earlier?
This is partly because the original HDFS team focused on very large clusters where cold failover was not practical. We assumed that Hadoop needed to provide its own built-in solution As we’ve developed this technology, we’ve heard directly from our customers that HA solutions are complex and that they prefer using their existing, well understood, solutions.”
This is partly because the original HDFS team focused on very large clusters where cold failover was not practical. We assumed that Hadoop needed to provide its own built-in solution As we’ve developed this technology, we’ve heard directly from our customers that HA solutions are complex and that they prefer using their existing, well understood, solutions.”
The following presentation offers more insight into HortonWorks initiatives for HA and what to expect post Hadoop 2.0 stabilization.

(Click on image to read the pdf)
(3) MapR
“MapR’s Lockless Storage Services feature a distributed HA architecture:
• The
metadata is distributed across the entire cluster. Every node stores and serves
a portion of the metadata.
•
Every portion of the metadata is replicated on three different nodes (this
number can be increased by the
administrator).
For example, the metadata corresponding to all the files and directories under
/project/
advertising
would exist on three nodes. The three replicas are consistent at all times
except, of course, for a
short
time after a failure.
• The metadata is persisted to disk, just like
the data.
The
following illustration shows how metadata is laid out in a MapR cluster (in
this case, a small 8-node cluster).
Each
colored triangle represents a portion of the overall metadata; or in MapR
terminology, the metadata of a single volume:”
(4) IBM
4.1 Redundancy in Hardware for Master nodes (name node, secondary name node, job tracker)
4.2 Use GPFS-SNC
(Click on image to read the pdf)
For the
academically oriented, there a few research papers from IBM Research which
offer more insight into the subject:
Feng Wang, Jie Qiu, Jie
Yang, Bo Dong, Xinhui Li, and Ying Li. 2009. Hadoop high availability through metadata replication. In Proceedings
of the first international workshop on Cloud data management
(CloudDB '09). ACM, New York ,
NY , USA ,
37-44
Rajagopal
Ananthanarayanan, Karan Gupta, Prashant Pandey, Himabindu Pucha, Prasenjit
Sarkar, Mansi Shah, and Renu Tewari. 2009. Cloud analytics: do we really need to reinvent the storage stack?. In Proceedings of the 2009 conference on Hot topics in cloud computing
(HotCloud'09). USENIX Association, Berkeley, CA, USA, , pages.
Lanyue Lu, Dean
Hildebrand, and Renu Tewari. 2011. ZoneFS: Stripe remodeling in cloud data centers. In Proceedings
of the 2011 IEEE 27th Symposium on Mass Storage Systems and Technologies
(MSST '11). IEEE Computer Society, Washington ,
DC , USA ,
1-10
Hi,
ReplyDeleteThanks for posting this summary. I wanted to clarify one thing -- at the top of the post you've outlined two options as being "Cloudera" options. While I'm proud to acknowledge that most of the engineering for these options was done by engineers here at Cloudera, the entirety of the code is upstream in the Apache Software Foundation project. It is entirely open source, and requires no external dependencies to set up other non-Apache software.
It's worth noting that this is unique among all the options you've described - the others either rely on proprietary non-free software (like MapR) or on complicated deployment (Linux-HA). The Linux-HA and VMware based solutions also only provide cold failover which is inadequate for real-time systems such as HBase.
Regardless, thanks again for the post.
-Todd Lipcon
(Engineer at Cloudera, Apache Hadoop committer)