Skip to main content

Offloading legacy with Hadoop

With most Fortune 500 organizations having invested in mainframes and other workload systems in the past, the rise of Big Data platforms poses newer integration challenges. The data integration and ETL players are finding fresh opportunities to solve business and IT problems within the Hadoop ecosystem.

To understand the context, challenges and opportunities, we asked a few questions to Syncsort CEO Lonne Jaffe. Syncsort provides fast, secure, enterprise-grade software spanning Big Data in Apache Hadoop to Big Iron on mainframes. At Syncsort, Lonne Jaffe is focusing on accelerating the growth of the company's high-performance Big Data offerings, both organically and through acquisition.

From mainframes to Hadoop and other platforms, Syncsort seems to have been evolving itself continuously. Where do you see Syncsort heading further?

Lonne Jaffe: Syncsort is extraordinary in its ability to continuously reinvent itself. Today, we’re innovating around Apache Hadoop and other Big Data platforms. In the future, we’re going to continue to dedicate a substantial portion of our annual R&D budget to building software technology that adds significant value to the most important secular growth platforms. We’re targeting innovations in Hadoop that save customers money in addition to supporting their business objectives -- since both saving money and improving top-line revenue growth are important in today’s macroeconomic climate. Beginning with the world’s fastest sort and most recently with our powerful cloud and on-premise software for Hadoop, we’ve applied time-perfected algorithms to tackle new data management challenges again and again.

Going forward, along with this organic R&D innovation, we’re also looking to continue to grow inorganically, acquiring fast-growing companies with extraordinary talent and highly differentiated software that will advance our strategy and easily snap into our existing technology.

One of our current strategic focus areas is delivering an end-to-end approach to offloading expensive and inefficient legacy data and data workloads into next generation platforms like Hadoop, both on-premise and in the cloud.  There’s tremendous interest from customers in this space and it’s a market segment that builds on our brand reputation for blazingly fast, secure and enterprise-scale data preparation and transformation technology.

You’ll see us continue to create and expand on strategic technology partnerships with leading vendors in the Big Data segment. Big Data platforms themselves are evolving very rapidly, and every week new technologies emerge from the Darwinian soup that is today’s Hadoop ecosystem. For example we’re excited about the potential for the Apache Spark project to dramatically accelerate certain processing and analytical workloads.

We’ll also continue to make contributions to the Apache Hadoop community. For example, on the heels of our well-received contribution to MapReduce last year, we recently submitted a JIRA ticket to make improvements to Apache Sqoop, which will improve Hadoop connectivity to mainframe systems

It seems you are bullish on Hadoop. What makes Hadoop so attractive to Syncsort?

Lonne Jaffe: Hadoop and the surrounding ecosystem is one of the most important secular growth opportunities in the technology industry today. Over the last few decades, you can count on one hand the number of platforms that have received a similar level of broad support from so many smaller, fast growing innovators as well as larger, incumbent vendors, in such a short period of time. Customers are starting to put Hadoop at the center of their enterprise data architectures, as part of what Cloudera has been calling the “Enterprise Data Hub.” There’s also a sizable amount of venture capital money flowing into Hadoop, rivaled only by the substantial R&D spend from larger technology vendors. As a result, Hadoop is getting better, faster than almost anything else in the technology industry.

One trend around Hadoop that has emerged over the last couple of years is customers offloading inefficient workloads and data from legacy systems -- saving tremendous amounts of money and bringing new analytical capabilities, as well as, rapidly-improving platform to the data. We’re investing heavily in this offload market opportunity, building new technology to make legacy offload-to-Hadoop as painless and low-risk as possible.

Syncsort has always been about making the latest powerful platforms easier to use, faster and more capable, so it was a pretty easy strategic decision to go “all in” on Hadoop, both on premise and in the cloud with Amazon Elastic MapReduce. Many of the processing frameworks on Hadoop, such as MapReduce or Spark, are very sort-intensive – this plays to Syncsort’s strengths, and we were able to infuse our advanced, high performance algorithms and optimizer into both our on-premise and cloud-based Hadoop products relatively easily. 


  1. What are some of the approaches that you use to ensure that the ETL that runs on Hadoop has a high confidence factor in providing accurate results?

  2. Hi, There is no magic bullet here but let me take a short stab at answering your question. First and foremost is testing. Developers need to unit test, QA needs to functional and system test. For developers, it's really helpful if the ETL approach or tool they are using would allow them test "locally" meaning on their desktop/laptop and also to deploy the ETL as MapReduce on a development/test cluster.

    QA then needs to test that the ETL is functionally working on a QA cluster and compare the results with the anticipated results consulting with the business owners/analysts that will be "consuming" the results.

    If you're using a tool, it's helpful to use a tool that is used by hundreds if not thousands of users and has been in production use for many years. This way you're assured the results are accurate because the product is "tried and true". One other note: if the product is generating is generating code (generating Java, Pig, HiveQL, etc.) to execute the ETL on Hadoop, it's going to be harder to debug issues. Think about this: if you get a Java exception and you can't tie it back to the GUI where you developed the ETL, how are you going to debug it.

    Vendor promotion: Syncsort can do all of this with DMX-h and you can try it out for free

    --Keith Kohl


Post a Comment

Popular articles

5 online tools in data visualization playground

While building up an analytics dashboard, one of the major decision points is regarding the type of charts and graphs that would provide better insight into the data. To avoid a lot of re-work later, it makes sense to try the various chart options during the requirement and design phase. It is probably a well known myth that existing tool options in any product can serve all the user requirements with just minor configuration changes. We all know and realize that code needs to be written to serve each customer’s individual needs. To that effect, here are 5 tools that could empower your technical and business teams to decide on visualization options during the requirement phase. Listed below are online tools for you to add data and use as playground. 1)      Many Eyes : Many Eyes is a data visualization experiment by IBM Research and the IBM Cognos software group. This tool provides option to upload data sets and create visualizations including Scatter Plot, Tree Ma

Data deduplication tactics with HDFS and MapReduce

As the amount of data continues to grow exponentially, there has been increased focus on stored data reduction methods. Data compression, single instance store and data deduplication are among the common techniques employed for stored data reduction. Deduplication often refers to elimination of redundant subfiles (also known as chunks, blocks, or extents). Unlike compression, data is not changed and eliminates storage capacity for identical data. Data deduplication offers significant advantage in terms of reduction in storage, network bandwidth and promises increased scalability. From a simplistic use case perspective, we can see application in removing duplicates in Call Detail Record (CDR) for a Telecom carrier. Similarly, we may apply the technique to optimize on network traffic carrying the same data packets. Some of the common methods for data deduplication in storage architecture include hashing, binary comparison and delta differencing. In this post, we focus o

In-memory data model with Apache Gora

Open source in-memory data model and persistence for big data framework Apache Gora™ version 0.3, was released in May 2013. The 0.3 release offers significant improvements and changes to a number of modules including a number of bug fixes. However, what may be of significant interest to the DynamoDB community will be the addition of a gora-dynamodb datastore for mapping and persisting objects to Amazon's DynamoDB . Additionally the release includes various improvements to the gora-core and gora-cassandra modules as well as a new Web Services API implementation which enables users to extend Gora to any cloud storage platform of their choice. This 2-part post provides commentary on all of the above and a whole lot more, expanding to cover where Gora fits in within the NoSQL and Big Data space, the development challenges and features which have been baked into Gora 0.3 and finally what we have on the road map for the 0.4 development drive. Introducing Apache Gora Although