Skip to main content

Ciao latency, hallo speed

While Hadoop had long been suspect to higher latency engines due to inherent MapReduce design, the same does not hold true any longer. With advent of faster processing engines for Hadoop, distributed data can now be processed with lower latency and in more efficient manner. We continue our discussion with Stephan Ewen to find out more about Apache Flink for distributed data processing. In the first part of the discussion, we focused on technical aspects of Apache Flink's working. Now we turn our attention to the comparative use and fitment in the overall Hadoop ecosystem. Read below to find out what Ewen has to say. 

How does Apache Flink technically compare to Spark and are there any performance benefits?

Flink and Spark start from different points in the system design space. In the end, it is really a question of finding the right tool for a particular workload. Flink’s runtime has some unique features that are beneficial in certain workloads.

Flink uses data streaming rather than batch processing as much as possible to execute both batch and streaming programs. This means for streaming programs that they are executed in a real streaming fashion, with more flexible windows, lower latency, and long living operators. For batch programs, intermediate data sets are often piped to their consumers as they are created, saving on memory and disk I/O for data sets larger than memory.
Flink is memory-aware, operating on binary data rather than Java objects. This makes heavy data crunching inside the JVM efficient, and alleviates many of the problems that the JVM has for data-intensive workloads.
Flink optimizes programs in a pre-flight stage using a cost-based optimizer rather than eagerly sending programs to the cluster. This may have advantages in performance, and helps the debuggability of the programs.
Flink has dedicated iteration operators in the APIs and in the runtime. These operators support fast iterations and allow the system to be very efficient, for example, in case of graphs.

What lies ahead on the roadmap for Apache Flink in 2015?

The Flink community has recently discussed in the developer mailing list and published a roadmap for 2015. The roadmap includes more libraries and applications on top of Flink (e.g., a graph and a Machine Learning library), support for interactive programs, improvements to streaming functionality, performance, and robustness, as well as integration with other Apache and open source projects. (See here for more details on the roadmap)

In which use cases do you see Apache Flink being a good fit vis-à-vis other ecosystem options?

Flink’s batch programs shine when using data-intensive and compute intensive pipelines - and even more so when including iterative parts. This includes both complex ETL jobs, as well as data intensive machine learning algorithms. Flink’s architecture is designed to combine robustness with the ease of use and performance benefits of modern APIs and in-memory processing. A good example can be recommendation systems for objects like new movies on Netflix, or shopping articles on Amazon.

For data streaming use cases, the newly streaming API (beta status) offers beautiful high-level APIs with flexible windowing semantics, backed by a low-latency execution engine.

Graph algorithms work particularly well on Flink, due to its strong support for (stateful) iterative algorithms. As one of the first major libraries, Flink’s graph library “Gelly” has been added in its first version.

Stephan Ewen is committer and Vice President of Apache Flink and co-founder and CTO of data Artisans, a Berlin-based company that is developing and contributing to Apache Flink. Before founding data Artisans, Stephan was leading the development of Flink since the early days of the project (then called Stratosphere). Stephan holds a PhD in Computer Science from the University of Technology, Berlin, and has been with IBM Research and Microsoft Research in the course of several internships.


Popular posts from this blog

Hadoop's 10 in LinkedIn's 10

LinkedIn, the pioneering professional social network has turned 10 years old. One of the hallmarks of its journey has been its technical accomplishments and significant contribution to open source, particularly in the last few years. Hadoop occupies a central place in its technical environment powering some of the most used features of desktop and mobile app. As LinkedIn enters the second decade of its existence, here is a look at 10 major projects and products powered by Hadoop in its data ecosystem.
1)      Voldemort:Arguably, the most famous export of LinkedIn engineering, Voldemort is a distributed key-value storage system. Named after an antagonist in Harry Potter series and influenced by Amazon’s Dynamo DB, the wizardry in this database extends to its self healing features. Available in HA configuration, its layered, pluggable architecture implementations are being used for both read and read-write use cases.
2)      Azkaban:A batch job scheduling system with a friendly UI, Azkab…

Data deduplication tactics with HDFS and MapReduce

As the amount of data continues to grow exponentially, there has been increased focus on stored data reduction methods. Data compression, single instance store and data deduplication are among the common techniques employed for stored data reduction.
Deduplication often refers to elimination of redundant subfiles (also known as chunks, blocks, or extents). Unlike compression, data is not changed and eliminates storage capacity for identical data. Data deduplication offers significant advantage in terms of reduction in storage, network bandwidth and promises increased scalability.
From a simplistic use case perspective, we can see application in removing duplicates in Call Detail Record (CDR) for a Telecom carrier. Similarly, we may apply the technique to optimize on network traffic carrying the same data packets.
Some of the common methods for data deduplication in storage architecture include hashing, binary comparison and delta differencing. In this post, we focus on how MapReduce and…

Top Big Data Influencers of 2015

2015 was an exciting year for big data and hadoop ecosystem. We saw hadoop becoming an essential part of data management strategy of almost all major enterprise organizations. There is cut throat competition among IT vendors now to help realize the vision of data hub, data lake and data warehouse with Hadoop and Spark.
As part of its annual assessment of big data and hadoop ecosystem, HadoopSphere publishes a list of top big data influencers each year. The list is derived based on a scientific methodology which involves assessing various parameters in each category of influencers. HadoopSphere Top Big Data Influencers list reflects the people, products, organizations and portals that exercised the most influence on big data and ecosystem in a particular year. The influencers have been listed in the following categories:

AnalystsSocial MediaOnline MediaProductsTechiesCoachThought LeadersClick here to read the methodology used.

Analysts:Doug HenschenIt might have been hard to miss Doug…