Enterprises
want the cost-efficiency and universal access for cloud-based applications,
including Big Data-scale applications that formerly were presumed to be
feasible only when delivered by on-premises resources. This includes the poster-child of Big Data
applications, Hadoop.
At the same time, those
who even consider doing such a thing wonder about how it will all come together
so that their trusted applications run as reliably in the cloud as they do when
running on-premises. In many instances, Hadoop workloads will require hundreds
to thousands of servers – making them a prime candidate for the cost efficiency
that the cloud can bring. However any application at such a large scale becomes
more problematic to stand up, manage and maintain. Managing each node in
the cluster consistently - making sure each one is set consistently to the same
configuration and dependencies – AND managing changes to the entire set of
servers in a simple and consistent manner – is a huge headache for IT management,
and for their users.
I
had the privilege of running application and infrastructure operations for
several major consumer packaged goods companies on behalf of two major service
providers, develop a large scale public cloud service, and am now assisting
others firms to launch Big Data applications including Hadoop into the cloud.
With this background, I can attest that standing up massive applications in the
cloud, and running them effectively, requires both the right architectural
choices and a keen eye to lower-level application dependencies. It also requires making sure there are plans
for management automation, since no staff has unlimited time and thus getting
bogged down on low level tasks doesn’t allow IT to apply its focus across other
essential staff functions.
Cloud Solutions for Big Data
Given
the maturity that has taken place on the infrastructure side of the cloud
market, we’re now looking for similar levels of automation and refinement on
the application side. While there are others, I see three different camps of
solutions have emerged for bringing apps in to the cloud that can be applied to
Big Data:
1) Server
templates (or machine images), the earliest solutions provide pre-defined
and tested blueprints of common applications including the operating system,
software package, and scripts required for deployment and configuration of apps
in the cloud. These frame a quick solution for deployment of simple apps such
as web servers, simple databases and developer stacks. Additional templates can
be offered for Hadoop elements such as Master and Slave nodes in a Hadoop
cluster, to ease the deployment process and reduce the typical trial-and-error
process.
2) Platform-as-a-Service
(PaaS) offerings, typically built on top of infrastructure (IaaS) cloud
services such as Google (App Engine) and Microsoft Azure. These offer a
pre-defined set of software stacks that are tested and supported within the
boundaries of the service itself. In many cases, these are sets of well-known
components, isolated to run within a single language and runtime environment,
and on the service provider’s own cloud. A “Hadoop Stack” might be introduced
as a pre-packaged BDaaS! (Big Data as a Service)
3) Data model-driven cloud application platforms,
a newer approach, are agnostic to the underlying cloud layer. These platforms
sit on top of cloud orchestration platforms such as OpenStack or CloudStack,
then capture and assemble the multiple components involved with distributed
applications such as Hadoop, and manage them holistically as a single
management container in the cloud target. They use an underlying application
repository and workflow engine to automate many of the common deployment steps
such as infrastructure provisioning (by using the underlying cloud API),
installing software packages on each instance, setting configuration parameters
and dependencies between different elements. Most importantly, this approach
makes it possible to automate the ongoing management of Big Data applications
over time, for example when changing application settings, scaling servers up
or scaling-out, and even enabling Workload portability across public clouds to
privates, or vice versa, all relevant contingencies are captured.
![]() |
Cloud Solutions for Hadoop |
Each
option fits for different needs, and as with most things, there are smart ways,
and harder ways, to approach the task of integrating Hadoop-scale applications
into the cloud. Here’s a quick
summation what IT managers should do, and should avoid doing as they approach a
Hadoop in the clouds implementation.
Five Key Guidelines
1. Optimized
Infrastructure. Evaluate the available Infrastructure (IaaS) cloud
services and look at the available optimizations to streamline performance and
efficiency for apps such as Hadoop. For
example, Amazon Web Services High-Memory and High-Compute instances offered in
EC2, and the SSD storage optimizations offered by service providers such as
CloudSigma have specific value in this area.
2. Dynamic applications. Given the rapidly changing nature of
large distributed applications, and the business rules that influence them -
think about managing Hadoop apps holistically as a single entity rather than as
a piecemeal collection of servers. Hadoop is a terrific example of a highly
distributed, complex application that is in need of more holistic
management capabilities. Holistic
management would enable single-click actions on the entire Workload of servers
(to change a common parameter setting in the Hadoop slave nodes for example),
as opposed to having to manually edit a configuration file on hundreds of
individual instances.
3. Application Metrics. Understand Hadoop itself, its
components, its performance bottlenecks what key metrics are of value to the
business. How will performance be monitored, measured and what knobs can be
turned to resolve problems when they arise?
4. Scaling. Define business expectations with regard
to the required scale of the application, and be ready to start small and grow
incrementally. What types of management tasks can be automated, to save on what
is surely limited IT time, and what must
be relegated to manual control? More
automation is now possible with the solutions described above.
5. Staff Skill Sets. Team members need experience with cloud
infrastructure from a user’s perspective, beyond their application-specific
knowledge of Hadoop MapReduce, Hive or other analytics applications.
Consultants and service providers can help to some extent, as can finding tools
that streamline the process of automated provisioning, change management and
application monitoring.
With the plethora of scalable, on-demand
cloud services available on the market today, IT managers with Hadoop
applications now have the opportunity to take advantage of the far lower cost
and greater agility enabled when moving very large Hadoop workloads to the
cloud. The good news is that they have
a wide range of approaches to expedite the move – and the right one is awaiting
their thoughtful choice.
Comments
Post a Comment