Skip to main content

Making Hadoop applications work in the Cloud: Five key guidelines

Enterprises want the cost-efficiency and universal access for cloud-based applications, including Big Data-scale applications that formerly were presumed to be feasible only when delivered by on-premises resources.  This includes the poster-child of Big Data applications, Hadoop.

At the same time, those who even consider doing such a thing wonder about how it will all come together so that their trusted applications run as reliably in the cloud as they do when running on-premises. In many instances, Hadoop workloads will require hundreds to thousands of servers – making them a prime candidate for the cost efficiency that the cloud can bring. However any application at such a large scale becomes more problematic to stand up, manage and maintain.  Managing each node in the cluster consistently - making sure each one is set consistently to the same configuration and dependencies – AND managing changes to the entire set of servers in a simple and consistent manner – is a huge headache for IT management, and for their users.   

I had the privilege of running application and infrastructure operations for several major consumer packaged goods companies on behalf of two major service providers, develop a large scale public cloud service, and am now assisting others firms to launch Big Data applications including Hadoop into the cloud. With this background, I can attest that standing up massive applications in the cloud, and running them effectively, requires both the right architectural choices and a keen eye to lower-level application dependencies.  It also requires making sure there are plans for management automation, since no staff has unlimited time and thus getting bogged down on low level tasks doesn’t allow IT to apply its focus across other essential staff functions. 

Cloud Solutions for Big Data

Given the maturity that has taken place on the infrastructure side of the cloud market, we’re now looking for similar levels of automation and refinement on the application side. While there are others, I see three different camps of solutions have emerged for bringing apps in to the cloud that can be applied to Big Data:

1) Server templates (or machine images), the earliest solutions provide pre-defined and tested blueprints of common applications including the operating system, software package, and scripts required for deployment and configuration of apps in the cloud. These frame a quick solution for deployment of simple apps such as web servers, simple databases and developer stacks. Additional templates can be offered for Hadoop elements such as Master and Slave nodes in a Hadoop cluster, to ease the deployment process and reduce the typical trial-and-error process.

2) Platform-as-a-Service (PaaS) offerings, typically built on top of infrastructure (IaaS) cloud services such as Google (App Engine) and Microsoft Azure. These offer a pre-defined set of software stacks that are tested and supported within the boundaries of the service itself. In many cases, these are sets of well-known components, isolated to run within a single language and runtime environment, and on the service provider’s own cloud. A “Hadoop Stack” might be introduced as a pre-packaged BDaaS! (Big Data as a Service)

3)  Data model-driven cloud application platforms, a newer approach, are agnostic to the underlying cloud layer. These platforms sit on top of cloud orchestration platforms such as OpenStack or CloudStack, then capture and assemble the multiple components involved with distributed applications such as Hadoop, and manage them holistically as a single management container in the cloud target. They use an underlying application repository and workflow engine to automate many of the common deployment steps such as infrastructure provisioning (by using the underlying cloud API), installing software packages on each instance, setting configuration parameters and dependencies between different elements. Most importantly, this approach makes it possible to automate the ongoing management of Big Data applications over time, for example when changing application settings, scaling servers up or scaling-out, and even enabling Workload portability across public clouds to privates, or vice versa, all relevant contingencies are captured.
Cloud Solutions for Hadoop

Each option fits for different needs, and as with most things, there are smart ways, and harder ways, to approach the task of integrating Hadoop-scale applications into the cloud.   Here’s a quick summation what IT managers should do, and should avoid doing as they approach a Hadoop in the clouds implementation.

Five Key Guidelines

1. Optimized Infrastructure. Evaluate the available Infrastructure (IaaS) cloud services and look at the available optimizations to streamline performance and efficiency for apps such as Hadoop.  For example, Amazon Web Services High-Memory and High-Compute instances offered in EC2, and the SSD storage optimizations offered by service providers such as CloudSigma have specific value in this area.

2. Dynamic applications. Given the rapidly changing nature of large distributed applications, and the business rules that influence them - think about managing Hadoop apps holistically as a single entity rather than as a piecemeal collection of servers. Hadoop is a terrific example of a highly distributed, complex application that is in need of more holistic management capabilities. Holistic management would enable single-click actions on the entire Workload of servers (to change a common parameter setting in the Hadoop slave nodes for example), as opposed to having to manually edit a configuration file on hundreds of individual instances.

3. Application Metrics. Understand Hadoop itself, its components, its performance bottlenecks what key metrics are of value to the business. How will performance be monitored, measured and what knobs can be turned to resolve problems when they arise?

4. Scaling. Define business expectations with regard to the required scale of the application, and be ready to start small and grow incrementally. What types of management tasks can be automated, to save on what is surely limited IT time,  and what must be relegated to manual control?  More automation is now possible with the solutions described above.

5. Staff Skill Sets. Team members need experience with cloud infrastructure from a user’s perspective, beyond their application-specific knowledge of Hadoop MapReduce, Hive or other analytics applications. Consultants and service providers can help to some extent, as can finding tools that streamline the process of automated provisioning, change management and application monitoring.

With the plethora of scalable, on-demand cloud services available on the market today, IT managers with Hadoop applications now have the opportunity to take advantage of the far lower cost and greater agility enabled when moving very large Hadoop workloads to the cloud.   The good news is that they have a wide range of approaches to expedite the move – and the right one is awaiting their thoughtful choice.
About the author: 
John Yung
John Yung is a former cloud business unit manager at both Savvis and Equinix who ran the cloud-deployed applications for several major consumer packaged goods companies. He is the founder and CEO of Appcara, provider of solutions for running complex, multi-tier and distributed applications in public and private clouds environments.


Popular posts from this blog

Hadoop's 10 in LinkedIn's 10

LinkedIn, the pioneering professional social network has turned 10 years old. One of the hallmarks of its journey has been its technical accomplishments and significant contribution to open source, particularly in the last few years. Hadoop occupies a central place in its technical environment powering some of the most used features of desktop and mobile app. As LinkedIn enters the second decade of its existence, here is a look at 10 major projects and products powered by Hadoop in its data ecosystem.
1)      Voldemort:Arguably, the most famous export of LinkedIn engineering, Voldemort is a distributed key-value storage system. Named after an antagonist in Harry Potter series and influenced by Amazon’s Dynamo DB, the wizardry in this database extends to its self healing features. Available in HA configuration, its layered, pluggable architecture implementations are being used for both read and read-write use cases.
2)      Azkaban:A batch job scheduling system with a friendly UI, Azkab…

Data deduplication tactics with HDFS and MapReduce

As the amount of data continues to grow exponentially, there has been increased focus on stored data reduction methods. Data compression, single instance store and data deduplication are among the common techniques employed for stored data reduction.
Deduplication often refers to elimination of redundant subfiles (also known as chunks, blocks, or extents). Unlike compression, data is not changed and eliminates storage capacity for identical data. Data deduplication offers significant advantage in terms of reduction in storage, network bandwidth and promises increased scalability.
From a simplistic use case perspective, we can see application in removing duplicates in Call Detail Record (CDR) for a Telecom carrier. Similarly, we may apply the technique to optimize on network traffic carrying the same data packets.
Some of the common methods for data deduplication in storage architecture include hashing, binary comparison and delta differencing. In this post, we focus on how MapReduce and…

Top Big Data Influencers of 2015

2015 was an exciting year for big data and hadoop ecosystem. We saw hadoop becoming an essential part of data management strategy of almost all major enterprise organizations. There is cut throat competition among IT vendors now to help realize the vision of data hub, data lake and data warehouse with Hadoop and Spark.
As part of its annual assessment of big data and hadoop ecosystem, HadoopSphere publishes a list of top big data influencers each year. The list is derived based on a scientific methodology which involves assessing various parameters in each category of influencers. HadoopSphere Top Big Data Influencers list reflects the people, products, organizations and portals that exercised the most influence on big data and ecosystem in a particular year. The influencers have been listed in the following categories:

AnalystsSocial MediaOnline MediaProductsTechiesCoachThought LeadersClick here to read the methodology used.

Analysts:Doug HenschenIt might have been hard to miss Doug…