Enterprises want the cost-efficiency and universal access for cloud-based applications, including Big Data-scale applications that formerly were presumed to be feasible only when delivered by on-premises resources. This includes the poster-child of Big Data applications, Hadoop.
At the same time, those who even consider doing such a thing wonder about how it will all come together so that their trusted applications run as reliably in the cloud as they do when running on-premises. In many instances, Hadoop workloads will require hundreds to thousands of servers – making them a prime candidate for the cost efficiency that the cloud can bring. However any application at such a large scale becomes more problematic to stand up, manage and maintain. Managing each node in the cluster consistently - making sure each one is set consistently to the same configuration and dependencies – AND managing changes to the entire set of servers in a simple and consistent manner – is a huge headache for IT management, and for their users.
I had the privilege of running application and infrastructure operations for several major consumer packaged goods companies on behalf of two major service providers, develop a large scale public cloud service, and am now assisting others firms to launch Big Data applications including Hadoop into the cloud. With this background, I can attest that standing up massive applications in the cloud, and running them effectively, requires both the right architectural choices and a keen eye to lower-level application dependencies. It also requires making sure there are plans for management automation, since no staff has unlimited time and thus getting bogged down on low level tasks doesn’t allow IT to apply its focus across other essential staff functions.
Cloud Solutions for Big Data
Given the maturity that has taken place on the infrastructure side of the cloud market, we’re now looking for similar levels of automation and refinement on the application side. While there are others, I see three different camps of solutions have emerged for bringing apps in to the cloud that can be applied to Big Data:
1) Server templates (or machine images), the earliest solutions provide pre-defined and tested blueprints of common applications including the operating system, software package, and scripts required for deployment and configuration of apps in the cloud. These frame a quick solution for deployment of simple apps such as web servers, simple databases and developer stacks. Additional templates can be offered for Hadoop elements such as Master and Slave nodes in a Hadoop cluster, to ease the deployment process and reduce the typical trial-and-error process.
2) Platform-as-a-Service (PaaS) offerings, typically built on top of infrastructure (IaaS) cloud services such as Google (App Engine) and Microsoft Azure. These offer a pre-defined set of software stacks that are tested and supported within the boundaries of the service itself. In many cases, these are sets of well-known components, isolated to run within a single language and runtime environment, and on the service provider’s own cloud. A “Hadoop Stack” might be introduced as a pre-packaged BDaaS! (Big Data as a Service)
3) Data model-driven cloud application platforms, a newer approach, are agnostic to the underlying cloud layer. These platforms sit on top of cloud orchestration platforms such as OpenStack or CloudStack, then capture and assemble the multiple components involved with distributed applications such as Hadoop, and manage them holistically as a single management container in the cloud target. They use an underlying application repository and workflow engine to automate many of the common deployment steps such as infrastructure provisioning (by using the underlying cloud API), installing software packages on each instance, setting configuration parameters and dependencies between different elements. Most importantly, this approach makes it possible to automate the ongoing management of Big Data applications over time, for example when changing application settings, scaling servers up or scaling-out, and even enabling Workload portability across public clouds to privates, or vice versa, all relevant contingencies are captured.
|Cloud Solutions for Hadoop|
Each option fits for different needs, and as with most things, there are smart ways, and harder ways, to approach the task of integrating Hadoop-scale applications into the cloud. Here’s a quick summation what IT managers should do, and should avoid doing as they approach a Hadoop in the clouds implementation.
Five Key Guidelines
1. Optimized Infrastructure. Evaluate the available Infrastructure (IaaS) cloud services and look at the available optimizations to streamline performance and efficiency for apps such as Hadoop. For example, Amazon Web Services High-Memory and High-Compute instances offered in EC2, and the SSD storage optimizations offered by service providers such as CloudSigma have specific value in this area.
2. Dynamic applications. Given the rapidly changing nature of large distributed applications, and the business rules that influence them - think about managing Hadoop apps holistically as a single entity rather than as a piecemeal collection of servers. Hadoop is a terrific example of a highly distributed, complex application that is in need of more holistic management capabilities. Holistic management would enable single-click actions on the entire Workload of servers (to change a common parameter setting in the Hadoop slave nodes for example), as opposed to having to manually edit a configuration file on hundreds of individual instances.
3. Application Metrics. Understand Hadoop itself, its components, its performance bottlenecks what key metrics are of value to the business. How will performance be monitored, measured and what knobs can be turned to resolve problems when they arise?
4. Scaling. Define business expectations with regard to the required scale of the application, and be ready to start small and grow incrementally. What types of management tasks can be automated, to save on what is surely limited IT time, and what must be relegated to manual control? More automation is now possible with the solutions described above.
5. Staff Skill Sets. Team members need experience with cloud infrastructure from a user’s perspective, beyond their application-specific knowledge of Hadoop MapReduce, Hive or other analytics applications. Consultants and service providers can help to some extent, as can finding tools that streamline the process of automated provisioning, change management and application monitoring.
With the plethora of scalable, on-demand cloud services available on the market today, IT managers with Hadoop applications now have the opportunity to take advantage of the far lower cost and greater agility enabled when moving very large Hadoop workloads to the cloud. The good news is that they have a wide range of approaches to expedite the move – and the right one is awaiting their thoughtful choice.