Skip to main content

Making Hadoop applications work in the Cloud: Five key guidelines

Enterprises want the cost-efficiency and universal access for cloud-based applications, including Big Data-scale applications that formerly were presumed to be feasible only when delivered by on-premises resources.  This includes the poster-child of Big Data applications, Hadoop.

At the same time, those who even consider doing such a thing wonder about how it will all come together so that their trusted applications run as reliably in the cloud as they do when running on-premises. In many instances, Hadoop workloads will require hundreds to thousands of servers – making them a prime candidate for the cost efficiency that the cloud can bring. However any application at such a large scale becomes more problematic to stand up, manage and maintain.  Managing each node in the cluster consistently - making sure each one is set consistently to the same configuration and dependencies – AND managing changes to the entire set of servers in a simple and consistent manner – is a huge headache for IT management, and for their users.   

I had the privilege of running application and infrastructure operations for several major consumer packaged goods companies on behalf of two major service providers, develop a large scale public cloud service, and am now assisting others firms to launch Big Data applications including Hadoop into the cloud. With this background, I can attest that standing up massive applications in the cloud, and running them effectively, requires both the right architectural choices and a keen eye to lower-level application dependencies.  It also requires making sure there are plans for management automation, since no staff has unlimited time and thus getting bogged down on low level tasks doesn’t allow IT to apply its focus across other essential staff functions. 

Cloud Solutions for Big Data

Given the maturity that has taken place on the infrastructure side of the cloud market, we’re now looking for similar levels of automation and refinement on the application side. While there are others, I see three different camps of solutions have emerged for bringing apps in to the cloud that can be applied to Big Data:

1) Server templates (or machine images), the earliest solutions provide pre-defined and tested blueprints of common applications including the operating system, software package, and scripts required for deployment and configuration of apps in the cloud. These frame a quick solution for deployment of simple apps such as web servers, simple databases and developer stacks. Additional templates can be offered for Hadoop elements such as Master and Slave nodes in a Hadoop cluster, to ease the deployment process and reduce the typical trial-and-error process.

2) Platform-as-a-Service (PaaS) offerings, typically built on top of infrastructure (IaaS) cloud services such as Google (App Engine) and Microsoft Azure. These offer a pre-defined set of software stacks that are tested and supported within the boundaries of the service itself. In many cases, these are sets of well-known components, isolated to run within a single language and runtime environment, and on the service provider’s own cloud. A “Hadoop Stack” might be introduced as a pre-packaged BDaaS! (Big Data as a Service)

3)  Data model-driven cloud application platforms, a newer approach, are agnostic to the underlying cloud layer. These platforms sit on top of cloud orchestration platforms such as OpenStack or CloudStack, then capture and assemble the multiple components involved with distributed applications such as Hadoop, and manage them holistically as a single management container in the cloud target. They use an underlying application repository and workflow engine to automate many of the common deployment steps such as infrastructure provisioning (by using the underlying cloud API), installing software packages on each instance, setting configuration parameters and dependencies between different elements. Most importantly, this approach makes it possible to automate the ongoing management of Big Data applications over time, for example when changing application settings, scaling servers up or scaling-out, and even enabling Workload portability across public clouds to privates, or vice versa, all relevant contingencies are captured.
Cloud Solutions for Hadoop

Each option fits for different needs, and as with most things, there are smart ways, and harder ways, to approach the task of integrating Hadoop-scale applications into the cloud.   Here’s a quick summation what IT managers should do, and should avoid doing as they approach a Hadoop in the clouds implementation.

Five Key Guidelines

1. Optimized Infrastructure. Evaluate the available Infrastructure (IaaS) cloud services and look at the available optimizations to streamline performance and efficiency for apps such as Hadoop.  For example, Amazon Web Services High-Memory and High-Compute instances offered in EC2, and the SSD storage optimizations offered by service providers such as CloudSigma have specific value in this area.

2. Dynamic applications. Given the rapidly changing nature of large distributed applications, and the business rules that influence them - think about managing Hadoop apps holistically as a single entity rather than as a piecemeal collection of servers. Hadoop is a terrific example of a highly distributed, complex application that is in need of more holistic management capabilities. Holistic management would enable single-click actions on the entire Workload of servers (to change a common parameter setting in the Hadoop slave nodes for example), as opposed to having to manually edit a configuration file on hundreds of individual instances.

3. Application Metrics. Understand Hadoop itself, its components, its performance bottlenecks what key metrics are of value to the business. How will performance be monitored, measured and what knobs can be turned to resolve problems when they arise?

4. Scaling. Define business expectations with regard to the required scale of the application, and be ready to start small and grow incrementally. What types of management tasks can be automated, to save on what is surely limited IT time,  and what must be relegated to manual control?  More automation is now possible with the solutions described above.

5. Staff Skill Sets. Team members need experience with cloud infrastructure from a user’s perspective, beyond their application-specific knowledge of Hadoop MapReduce, Hive or other analytics applications. Consultants and service providers can help to some extent, as can finding tools that streamline the process of automated provisioning, change management and application monitoring.

With the plethora of scalable, on-demand cloud services available on the market today, IT managers with Hadoop applications now have the opportunity to take advantage of the far lower cost and greater agility enabled when moving very large Hadoop workloads to the cloud.   The good news is that they have a wide range of approaches to expedite the move – and the right one is awaiting their thoughtful choice.
About the author: 
John Yung
John Yung is a former cloud business unit manager at both Savvis and Equinix who ran the cloud-deployed applications for several major consumer packaged goods companies. He is the founder and CEO of Appcara, provider of solutions for running complex, multi-tier and distributed applications in public and private clouds environments.


Popular articles

5 online tools in data visualization playground

While building up an analytics dashboard, one of the major decision points is regarding the type of charts and graphs that would provide better insight into the data. To avoid a lot of re-work later, it makes sense to try the various chart options during the requirement and design phase. It is probably a well known myth that existing tool options in any product can serve all the user requirements with just minor configuration changes. We all know and realize that code needs to be written to serve each customer’s individual needs. To that effect, here are 5 tools that could empower your technical and business teams to decide on visualization options during the requirement phase. Listed below are online tools for you to add data and use as playground. 1)      Many Eyes : Many Eyes is a data visualization experiment by IBM Research and the IBM Cognos software group. This tool provides option to upload data sets and create visualizations including Scatter Plot, Tree Ma

Data deduplication tactics with HDFS and MapReduce

As the amount of data continues to grow exponentially, there has been increased focus on stored data reduction methods. Data compression, single instance store and data deduplication are among the common techniques employed for stored data reduction. Deduplication often refers to elimination of redundant subfiles (also known as chunks, blocks, or extents). Unlike compression, data is not changed and eliminates storage capacity for identical data. Data deduplication offers significant advantage in terms of reduction in storage, network bandwidth and promises increased scalability. From a simplistic use case perspective, we can see application in removing duplicates in Call Detail Record (CDR) for a Telecom carrier. Similarly, we may apply the technique to optimize on network traffic carrying the same data packets. Some of the common methods for data deduplication in storage architecture include hashing, binary comparison and delta differencing. In this post, we focus o

In-memory data model with Apache Gora

Open source in-memory data model and persistence for big data framework Apache Gora™ version 0.3, was released in May 2013. The 0.3 release offers significant improvements and changes to a number of modules including a number of bug fixes. However, what may be of significant interest to the DynamoDB community will be the addition of a gora-dynamodb datastore for mapping and persisting objects to Amazon's DynamoDB . Additionally the release includes various improvements to the gora-core and gora-cassandra modules as well as a new Web Services API implementation which enables users to extend Gora to any cloud storage platform of their choice. This 2-part post provides commentary on all of the above and a whole lot more, expanding to cover where Gora fits in within the NoSQL and Big Data space, the development challenges and features which have been baked into Gora 0.3 and finally what we have on the road map for the 0.4 development drive. Introducing Apache Gora Although