Skip to main content

Making Hadoop applications work in the Cloud: Five key guidelines

Enterprises want the cost-efficiency and universal access for cloud-based applications, including Big Data-scale applications that formerly were presumed to be feasible only when delivered by on-premises resources.  This includes the poster-child of Big Data applications, Hadoop.


At the same time, those who even consider doing such a thing wonder about how it will all come together so that their trusted applications run as reliably in the cloud as they do when running on-premises. In many instances, Hadoop workloads will require hundreds to thousands of servers – making them a prime candidate for the cost efficiency that the cloud can bring. However any application at such a large scale becomes more problematic to stand up, manage and maintain.  Managing each node in the cluster consistently - making sure each one is set consistently to the same configuration and dependencies – AND managing changes to the entire set of servers in a simple and consistent manner – is a huge headache for IT management, and for their users.   


I had the privilege of running application and infrastructure operations for several major consumer packaged goods companies on behalf of two major service providers, develop a large scale public cloud service, and am now assisting others firms to launch Big Data applications including Hadoop into the cloud. With this background, I can attest that standing up massive applications in the cloud, and running them effectively, requires both the right architectural choices and a keen eye to lower-level application dependencies.  It also requires making sure there are plans for management automation, since no staff has unlimited time and thus getting bogged down on low level tasks doesn’t allow IT to apply its focus across other essential staff functions. 

Cloud Solutions for Big Data


Given the maturity that has taken place on the infrastructure side of the cloud market, we’re now looking for similar levels of automation and refinement on the application side. While there are others, I see three different camps of solutions have emerged for bringing apps in to the cloud that can be applied to Big Data:

1) Server templates (or machine images), the earliest solutions provide pre-defined and tested blueprints of common applications including the operating system, software package, and scripts required for deployment and configuration of apps in the cloud. These frame a quick solution for deployment of simple apps such as web servers, simple databases and developer stacks. Additional templates can be offered for Hadoop elements such as Master and Slave nodes in a Hadoop cluster, to ease the deployment process and reduce the typical trial-and-error process.

2) Platform-as-a-Service (PaaS) offerings, typically built on top of infrastructure (IaaS) cloud services such as Google (App Engine) and Microsoft Azure. These offer a pre-defined set of software stacks that are tested and supported within the boundaries of the service itself. In many cases, these are sets of well-known components, isolated to run within a single language and runtime environment, and on the service provider’s own cloud. A “Hadoop Stack” might be introduced as a pre-packaged BDaaS! (Big Data as a Service)

3)  Data model-driven cloud application platforms, a newer approach, are agnostic to the underlying cloud layer. These platforms sit on top of cloud orchestration platforms such as OpenStack or CloudStack, then capture and assemble the multiple components involved with distributed applications such as Hadoop, and manage them holistically as a single management container in the cloud target. They use an underlying application repository and workflow engine to automate many of the common deployment steps such as infrastructure provisioning (by using the underlying cloud API), installing software packages on each instance, setting configuration parameters and dependencies between different elements. Most importantly, this approach makes it possible to automate the ongoing management of Big Data applications over time, for example when changing application settings, scaling servers up or scaling-out, and even enabling Workload portability across public clouds to privates, or vice versa, all relevant contingencies are captured.
Cloud Solutions for Hadoop


Each option fits for different needs, and as with most things, there are smart ways, and harder ways, to approach the task of integrating Hadoop-scale applications into the cloud.   Here’s a quick summation what IT managers should do, and should avoid doing as they approach a Hadoop in the clouds implementation.

Five Key Guidelines


1. Optimized Infrastructure. Evaluate the available Infrastructure (IaaS) cloud services and look at the available optimizations to streamline performance and efficiency for apps such as Hadoop.  For example, Amazon Web Services High-Memory and High-Compute instances offered in EC2, and the SSD storage optimizations offered by service providers such as CloudSigma have specific value in this area.

2. Dynamic applications. Given the rapidly changing nature of large distributed applications, and the business rules that influence them - think about managing Hadoop apps holistically as a single entity rather than as a piecemeal collection of servers. Hadoop is a terrific example of a highly distributed, complex application that is in need of more holistic management capabilities. Holistic management would enable single-click actions on the entire Workload of servers (to change a common parameter setting in the Hadoop slave nodes for example), as opposed to having to manually edit a configuration file on hundreds of individual instances.


3. Application Metrics. Understand Hadoop itself, its components, its performance bottlenecks what key metrics are of value to the business. How will performance be monitored, measured and what knobs can be turned to resolve problems when they arise?

4. Scaling. Define business expectations with regard to the required scale of the application, and be ready to start small and grow incrementally. What types of management tasks can be automated, to save on what is surely limited IT time,  and what must be relegated to manual control?  More automation is now possible with the solutions described above.

5. Staff Skill Sets. Team members need experience with cloud infrastructure from a user’s perspective, beyond their application-specific knowledge of Hadoop MapReduce, Hive or other analytics applications. Consultants and service providers can help to some extent, as can finding tools that streamline the process of automated provisioning, change management and application monitoring.


With the plethora of scalable, on-demand cloud services available on the market today, IT managers with Hadoop applications now have the opportunity to take advantage of the far lower cost and greater agility enabled when moving very large Hadoop workloads to the cloud.   The good news is that they have a wide range of approaches to expedite the move – and the right one is awaiting their thoughtful choice.
About the author: 
John Yung
John Yung is a former cloud business unit manager at both Savvis and Equinix who ran the cloud-deployed applications for several major consumer packaged goods companies. He is the founder and CEO of Appcara, provider of solutions for running complex, multi-tier and distributed applications in public and private clouds environments.
 

Comments

Popular posts from this blog

Low latency SQL querying on HBase

HBase has emerged as one of the most popular NoSQL database offering distributed, versioned, non-relational tables hosted on commodity hardware. However, with a large set of users coming from a relational SQL world, it made sense to bring the SQL back in this NoSQL. With Apache Phoenix, database professionals get a convenient way to query HBase through SQL in a fast and efficient manner. Continuing our discussion with James Taylor, the founder of Apache Phoenix, we focus on the functional aspects of Phoenix in this second part of interaction.
Although Apache Phoenix started off with distinct low latency advantage, have the other options like Hive/Impala (integrated with HBase) caught up in terms of performance?
No, these other tools such as Hive and Impala have not invested in improving performance against HBase data, so if anything, Phoenix's advantage has only gotten bigger as our performance improves.  See this link for comparison of Apache Phoenix with Apache Hive and Cloudera Im…

Data deduplication tactics with HDFS and MapReduce

As the amount of data continues to grow exponentially, there has been increased focus on stored data reduction methods. Data compression, single instance store and data deduplication are among the common techniques employed for stored data reduction.
Deduplication often refers to elimination of redundant subfiles (also known as chunks, blocks, or extents). Unlike compression, data is not changed and eliminates storage capacity for identical data. Data deduplication offers significant advantage in terms of reduction in storage, network bandwidth and promises increased scalability.
From a simplistic use case perspective, we can see application in removing duplicates in Call Detail Record (CDR) for a Telecom carrier. Similarly, we may apply the technique to optimize on network traffic carrying the same data packets.
Some of the common methods for data deduplication in storage architecture include hashing, binary comparison and delta differencing. In this post, we focus on how MapReduce and…

Pricing models for Hadoop products

A look at the various pricing models adopted by the vendors in the Hadoop ecosystem. While the pricing models are evolving in this rapid and dynamic market, listed below are some of the major variations utilized by companies in the sphere.
1) Per Node:Among the most common model, the node based pricing mechanism utilizes customized rules for determining pricing per node. This may be as straight forward as pricing per name node and data node or could have complex variants of pricing based on number of core processors utilized by the nodes in the cluster or per user license in case of applications.
2) Per TB:The data based pricing mechanism charges customer for license cost per TB of data. This model usually accounts non replicated data for computation of cost.
3) Subscription Support cost only:In this model, the vendor prefers to give away software for free but charges the customer for subscription support on a specified number of nodes. The support timings and level of support further …