Skip to main content

Decoding Hadoop ETL

Continuing the Q&A with Syncsort CEO Lonne Jaffe, we explore the ETL use cases in Hadoop ecosystem. Lonne explains some of the key distinguishing characteristics of ETL solutions and how they make a compelling use case with inexpensive implementations.

 

 

What makes a great ETL solution for Hadoop? Can you tell us the important characteristics?

Some legacy ETL products are too heavyweight to work well with Hadoop – they don’t run natively, they sit on edge nodes or they generate a lot of inefficient code that needs to be maintained. 

Our enterprise-grade Hadoop-based transformation engine sits on each node in a cluster to deliver accelerated performance and avoids generating code. We made an open source contribution that enabled our engine to run natively in the Hadoop environment, which was committed as MAPREDUCE-2454 in early 2013. We’re now delivering, to some of the largest and most sophisticated users of Hadoop in the world, a product that can handle complex data models that include disparate structured and unstructured data from varied data sources, including the mainframe. 

We’re also focusing our organic investments on making it as easy as possible to move legacy workloads and data into Hadoop. For example, we created a SQL analyzer that scans and creates maps of existing legacy SQL and assists in efficiently recreating those SQL-based legacy workloads in Hadoop. We also built a product that analyzes SMF records on the mainframe to identify the mainframe workloads that are best-suited to moving to Hadoop to save money, improve performance, and make the data accessible to next-generation analytics.

Expect continued improvements in these existing capabilities and more offerings like this from us. 

 

How critical is pricing in your segment? Do the customers bend backwards on the price point?

- In 2013, many companies had no budget for Hadoop.  Customers were just testing – no real dollars were committed to building out production Hadoop clusters.  That has changed substantially in 2014.  Organizations can now much more easily justify investment in Hadoop because they immediately realize cost reductions in legacy data warehouses, legacy ETL tools and mainframes, saving much more money than it costs to create their Hadoop environment. Groups within large enterprises that are running offload projects are aggregating power and budget since they’re freeing up so much annual spend. Offloading legacy workloads and data into Hadoop doesn’t only save money, but, it also brings a new class of analytical compute power to the data. These organizations can quickly demonstrate the competitive benefits of the advanced analytics that Hadoop makes possible, giving them insights that can help grow the business – contributing to the top line. Anything that can generate top line revenue growth and lower costs simultaneously is very valuable to enterprises today.

 

Among the various use cases of your products in Hadoop ecosystem, can you tell us about your most fascinating one?

- What’s most fascinating about our customers’ use cases is how they have changed the economics of managing data. One customer who measured the cost of managing a terabyte of data in their enterprise data warehouse at $100,000 was able to offload and manage data in Hadoop at a cost of only $1,000 per terabyte -- ETL-like workloads can represent as much of 40-60% of the capacity of legacy enterprise data warehouses systems. In another case, by offloading mainframe data and processing into Hadoop, a major bank was able to save money and make new analytical capabilities available. Customers can also offload workloads from legacy ETL products into Hadoop.
We’re focusing much of our R&D and acquisition bandwidth going forward on building unique products that make this offload-to-Hadoop process as seamless as possible.


<< Previous - Offloading with Hadoop

Comments

Popular posts from this blog

Hadoop's 10 in LinkedIn's 10

LinkedIn, the pioneering professional social network has turned 10 years old. One of the hallmarks of its journey has been its technical accomplishments and significant contribution to open source, particularly in the last few years. Hadoop occupies a central place in its technical environment powering some of the most used features of desktop and mobile app. As LinkedIn enters the second decade of its existence, here is a look at 10 major projects and products powered by Hadoop in its data ecosystem.
1)      Voldemort:Arguably, the most famous export of LinkedIn engineering, Voldemort is a distributed key-value storage system. Named after an antagonist in Harry Potter series and influenced by Amazon’s Dynamo DB, the wizardry in this database extends to its self healing features. Available in HA configuration, its layered, pluggable architecture implementations are being used for both read and read-write use cases.
2)      Azkaban:A batch job scheduling system with a friendly UI, Azkab…

Data deduplication tactics with HDFS and MapReduce

As the amount of data continues to grow exponentially, there has been increased focus on stored data reduction methods. Data compression, single instance store and data deduplication are among the common techniques employed for stored data reduction.
Deduplication often refers to elimination of redundant subfiles (also known as chunks, blocks, or extents). Unlike compression, data is not changed and eliminates storage capacity for identical data. Data deduplication offers significant advantage in terms of reduction in storage, network bandwidth and promises increased scalability.
From a simplistic use case perspective, we can see application in removing duplicates in Call Detail Record (CDR) for a Telecom carrier. Similarly, we may apply the technique to optimize on network traffic carrying the same data packets.
Some of the common methods for data deduplication in storage architecture include hashing, binary comparison and delta differencing. In this post, we focus on how MapReduce and…

Top Big Data Influencers of 2015

2015 was an exciting year for big data and hadoop ecosystem. We saw hadoop becoming an essential part of data management strategy of almost all major enterprise organizations. There is cut throat competition among IT vendors now to help realize the vision of data hub, data lake and data warehouse with Hadoop and Spark.
As part of its annual assessment of big data and hadoop ecosystem, HadoopSphere publishes a list of top big data influencers each year. The list is derived based on a scientific methodology which involves assessing various parameters in each category of influencers. HadoopSphere Top Big Data Influencers list reflects the people, products, organizations and portals that exercised the most influence on big data and ecosystem in a particular year. The influencers have been listed in the following categories:

AnalystsSocial MediaOnline MediaProductsTechiesCoachThought LeadersClick here to read the methodology used.

Analysts:Doug HenschenIt might have been hard to miss Doug…