Skip to main content

Ciao latency, hallo speed

While Hadoop had long been suspect to higher latency engines due to inherent MapReduce design, the same does not hold true any longer. With advent of faster processing engines for Hadoop, distributed data can now be processed with lower latency and in more efficient manner. We continue our discussion with Stephan Ewen to find out more about Apache Flink for distributed data processing. In the first part of the discussion, we focused on technical aspects of Apache Flink's working. Now we turn our attention to the comparative use and fitment in the overall Hadoop ecosystem. Read below to find out what Ewen has to say. 

How does Apache Flink technically compare to Spark and are there any performance benefits?


Flink and Spark start from different points in the system design space. In the end, it is really a question of finding the right tool for a particular workload. Flink’s runtime has some unique features that are beneficial in certain workloads.

Flink uses data streaming rather than batch processing as much as possible to execute both batch and streaming programs. This means for streaming programs that they are executed in a real streaming fashion, with more flexible windows, lower latency, and long living operators. For batch programs, intermediate data sets are often piped to their consumers as they are created, saving on memory and disk I/O for data sets larger than memory.
Flink is memory-aware, operating on binary data rather than Java objects. This makes heavy data crunching inside the JVM efficient, and alleviates many of the problems that the JVM has for data-intensive workloads.
Flink optimizes programs in a pre-flight stage using a cost-based optimizer rather than eagerly sending programs to the cluster. This may have advantages in performance, and helps the debuggability of the programs.
Flink has dedicated iteration operators in the APIs and in the runtime. These operators support fast iterations and allow the system to be very efficient, for example, in case of graphs.

What lies ahead on the roadmap for Apache Flink in 2015?


The Flink community has recently discussed in the developer mailing list and published a roadmap for 2015. The roadmap includes more libraries and applications on top of Flink (e.g., a graph and a Machine Learning library), support for interactive programs, improvements to streaming functionality, performance, and robustness, as well as integration with other Apache and open source projects. (See here for more details on the roadmap)


In which use cases do you see Apache Flink being a good fit vis-à-vis other ecosystem options?


Flink’s batch programs shine when using data-intensive and compute intensive pipelines - and even more so when including iterative parts. This includes both complex ETL jobs, as well as data intensive machine learning algorithms. Flink’s architecture is designed to combine robustness with the ease of use and performance benefits of modern APIs and in-memory processing. A good example can be recommendation systems for objects like new movies on Netflix, or shopping articles on Amazon.

For data streaming use cases, the newly streaming API (beta status) offers beautiful high-level APIs with flexible windowing semantics, backed by a low-latency execution engine.

Graph algorithms work particularly well on Flink, due to its strong support for (stateful) iterative algorithms. As one of the first major libraries, Flink’s graph library “Gelly” has been added in its first version.



Stephan Ewen is committer and Vice President of Apache Flink and co-founder and CTO of data Artisans, a Berlin-based company that is developing and contributing to Apache Flink. Before founding data Artisans, Stephan was leading the development of Flink since the early days of the project (then called Stratosphere). Stephan holds a PhD in Computer Science from the University of Technology, Berlin, and has been with IBM Research and Microsoft Research in the course of several internships.


Comments

Popular posts from this blog

Low latency SQL querying on HBase

HBase has emerged as one of the most popular NoSQL database offering distributed, versioned, non-relational tables hosted on commodity hardware. However, with a large set of users coming from a relational SQL world, it made sense to bring the SQL back in this NoSQL. With Apache Phoenix, database professionals get a convenient way to query HBase through SQL in a fast and efficient manner. Continuing our discussion with James Taylor, the founder of Apache Phoenix, we focus on the functional aspects of Phoenix in this second part of interaction.
Although Apache Phoenix started off with distinct low latency advantage, have the other options like Hive/Impala (integrated with HBase) caught up in terms of performance?
No, these other tools such as Hive and Impala have not invested in improving performance against HBase data, so if anything, Phoenix's advantage has only gotten bigger as our performance improves.  See this link for comparison of Apache Phoenix with Apache Hive and Cloudera Im…

Data deduplication tactics with HDFS and MapReduce

As the amount of data continues to grow exponentially, there has been increased focus on stored data reduction methods. Data compression, single instance store and data deduplication are among the common techniques employed for stored data reduction.
Deduplication often refers to elimination of redundant subfiles (also known as chunks, blocks, or extents). Unlike compression, data is not changed and eliminates storage capacity for identical data. Data deduplication offers significant advantage in terms of reduction in storage, network bandwidth and promises increased scalability.
From a simplistic use case perspective, we can see application in removing duplicates in Call Detail Record (CDR) for a Telecom carrier. Similarly, we may apply the technique to optimize on network traffic carrying the same data packets.
Some of the common methods for data deduplication in storage architecture include hashing, binary comparison and delta differencing. In this post, we focus on how MapReduce and…

Pricing models for Hadoop products

A look at the various pricing models adopted by the vendors in the Hadoop ecosystem. While the pricing models are evolving in this rapid and dynamic market, listed below are some of the major variations utilized by companies in the sphere.
1) Per Node:Among the most common model, the node based pricing mechanism utilizes customized rules for determining pricing per node. This may be as straight forward as pricing per name node and data node or could have complex variants of pricing based on number of core processors utilized by the nodes in the cluster or per user license in case of applications.
2) Per TB:The data based pricing mechanism charges customer for license cost per TB of data. This model usually accounts non replicated data for computation of cost.
3) Subscription Support cost only:In this model, the vendor prefers to give away software for free but charges the customer for subscription support on a specified number of nodes. The support timings and level of support further …