Skip to main content

Offloading legacy with Hadoop

With most Fortune 500 organizations having invested in mainframes and other workload systems in the past, the rise of Big Data platforms poses newer integration challenges. The data integration and ETL players are finding fresh opportunities to solve business and IT problems within the Hadoop ecosystem.

To understand the context, challenges and opportunities, we asked a few questions to Syncsort CEO Lonne Jaffe. Syncsort provides fast, secure, enterprise-grade software spanning Big Data in Apache Hadoop to Big Iron on mainframes. At Syncsort, Lonne Jaffe is focusing on accelerating the growth of the company's high-performance Big Data offerings, both organically and through acquisition.

From mainframes to Hadoop and other platforms, Syncsort seems to have been evolving itself continuously. Where do you see Syncsort heading further?

Lonne Jaffe: Syncsort is extraordinary in its ability to continuously reinvent itself. Today, we’re innovating around Apache Hadoop and other Big Data platforms. In the future, we’re going to continue to dedicate a substantial portion of our annual R&D budget to building software technology that adds significant value to the most important secular growth platforms. We’re targeting innovations in Hadoop that save customers money in addition to supporting their business objectives -- since both saving money and improving top-line revenue growth are important in today’s macroeconomic climate. Beginning with the world’s fastest sort and most recently with our powerful cloud and on-premise software for Hadoop, we’ve applied time-perfected algorithms to tackle new data management challenges again and again.

Going forward, along with this organic R&D innovation, we’re also looking to continue to grow inorganically, acquiring fast-growing companies with extraordinary talent and highly differentiated software that will advance our strategy and easily snap into our existing technology.

One of our current strategic focus areas is delivering an end-to-end approach to offloading expensive and inefficient legacy data and data workloads into next generation platforms like Hadoop, both on-premise and in the cloud.  There’s tremendous interest from customers in this space and it’s a market segment that builds on our brand reputation for blazingly fast, secure and enterprise-scale data preparation and transformation technology.

You’ll see us continue to create and expand on strategic technology partnerships with leading vendors in the Big Data segment. Big Data platforms themselves are evolving very rapidly, and every week new technologies emerge from the Darwinian soup that is today’s Hadoop ecosystem. For example we’re excited about the potential for the Apache Spark project to dramatically accelerate certain processing and analytical workloads.

We’ll also continue to make contributions to the Apache Hadoop community. For example, on the heels of our well-received contribution to MapReduce last year, we recently submitted a JIRA ticket to make improvements to Apache Sqoop, which will improve Hadoop connectivity to mainframe systems

It seems you are bullish on Hadoop. What makes Hadoop so attractive to Syncsort?

Lonne Jaffe: Hadoop and the surrounding ecosystem is one of the most important secular growth opportunities in the technology industry today. Over the last few decades, you can count on one hand the number of platforms that have received a similar level of broad support from so many smaller, fast growing innovators as well as larger, incumbent vendors, in such a short period of time. Customers are starting to put Hadoop at the center of their enterprise data architectures, as part of what Cloudera has been calling the “Enterprise Data Hub.” There’s also a sizable amount of venture capital money flowing into Hadoop, rivaled only by the substantial R&D spend from larger technology vendors. As a result, Hadoop is getting better, faster than almost anything else in the technology industry.

One trend around Hadoop that has emerged over the last couple of years is customers offloading inefficient workloads and data from legacy systems -- saving tremendous amounts of money and bringing new analytical capabilities, as well as, rapidly-improving platform to the data. We’re investing heavily in this offload market opportunity, building new technology to make legacy offload-to-Hadoop as painless and low-risk as possible.

Syncsort has always been about making the latest powerful platforms easier to use, faster and more capable, so it was a pretty easy strategic decision to go “all in” on Hadoop, both on premise and in the cloud with Amazon Elastic MapReduce. Many of the processing frameworks on Hadoop, such as MapReduce or Spark, are very sort-intensive – this plays to Syncsort’s strengths, and we were able to infuse our advanced, high performance algorithms and optimizer into both our on-premise and cloud-based Hadoop products relatively easily. 


  1. What are some of the approaches that you use to ensure that the ETL that runs on Hadoop has a high confidence factor in providing accurate results?

  2. Hi, There is no magic bullet here but let me take a short stab at answering your question. First and foremost is testing. Developers need to unit test, QA needs to functional and system test. For developers, it's really helpful if the ETL approach or tool they are using would allow them test "locally" meaning on their desktop/laptop and also to deploy the ETL as MapReduce on a development/test cluster.

    QA then needs to test that the ETL is functionally working on a QA cluster and compare the results with the anticipated results consulting with the business owners/analysts that will be "consuming" the results.

    If you're using a tool, it's helpful to use a tool that is used by hundreds if not thousands of users and has been in production use for many years. This way you're assured the results are accurate because the product is "tried and true". One other note: if the product is generating is generating code (generating Java, Pig, HiveQL, etc.) to execute the ETL on Hadoop, it's going to be harder to debug issues. Think about this: if you get a Java exception and you can't tie it back to the GUI where you developed the ETL, how are you going to debug it.

    Vendor promotion: Syncsort can do all of this with DMX-h and you can try it out for free

    --Keith Kohl


Post a Comment

Popular posts from this blog

Hadoop's 10 in LinkedIn's 10

LinkedIn, the pioneering professional social network has turned 10 years old. One of the hallmarks of its journey has been its technical accomplishments and significant contribution to open source, particularly in the last few years. Hadoop occupies a central place in its technical environment powering some of the most used features of desktop and mobile app. As LinkedIn enters the second decade of its existence, here is a look at 10 major projects and products powered by Hadoop in its data ecosystem.
1)      Voldemort:Arguably, the most famous export of LinkedIn engineering, Voldemort is a distributed key-value storage system. Named after an antagonist in Harry Potter series and influenced by Amazon’s Dynamo DB, the wizardry in this database extends to its self healing features. Available in HA configuration, its layered, pluggable architecture implementations are being used for both read and read-write use cases.
2)      Azkaban:A batch job scheduling system with a friendly UI, Azkab…

Data deduplication tactics with HDFS and MapReduce

As the amount of data continues to grow exponentially, there has been increased focus on stored data reduction methods. Data compression, single instance store and data deduplication are among the common techniques employed for stored data reduction.
Deduplication often refers to elimination of redundant subfiles (also known as chunks, blocks, or extents). Unlike compression, data is not changed and eliminates storage capacity for identical data. Data deduplication offers significant advantage in terms of reduction in storage, network bandwidth and promises increased scalability.
From a simplistic use case perspective, we can see application in removing duplicates in Call Detail Record (CDR) for a Telecom carrier. Similarly, we may apply the technique to optimize on network traffic carrying the same data packets.
Some of the common methods for data deduplication in storage architecture include hashing, binary comparison and delta differencing. In this post, we focus on how MapReduce and…

Top Big Data Influencers of 2015

2015 was an exciting year for big data and hadoop ecosystem. We saw hadoop becoming an essential part of data management strategy of almost all major enterprise organizations. There is cut throat competition among IT vendors now to help realize the vision of data hub, data lake and data warehouse with Hadoop and Spark.
As part of its annual assessment of big data and hadoop ecosystem, HadoopSphere publishes a list of top big data influencers each year. The list is derived based on a scientific methodology which involves assessing various parameters in each category of influencers. HadoopSphere Top Big Data Influencers list reflects the people, products, organizations and portals that exercised the most influence on big data and ecosystem in a particular year. The influencers have been listed in the following categories:

AnalystsSocial MediaOnline MediaProductsTechiesCoachThought LeadersClick here to read the methodology used.

Analysts:Doug HenschenIt might have been hard to miss Doug…