Hadoop High 5 with IBM's Anjul Bhambhri

In the thought leadership series called 'Hadoop High 5', we ask leaders in big data and hadoop ecosystem on the vision and the future path. Continuing this series, we asked Anjul Bhambhri five questions. Anjul Bhambhri is the Vice President of Big Data at IBM. She was previously the Director of IBM Optim application and data life cycle management tools. She is a seasoned professional with over twenty-five years in the database industry. Over this time, Anjul has held various engineering and management positions at IBM, Informix and Sybase. Prior to her assignment in tools, Anjul spearheaded the development of XML capabilities in IBM's DB2 database server. She is a recipient of the YWCA of Silicon Valley's “Tribute to Women in Technology” award for 2009. Anjul holds a degree in Electrical Engineering. 
Click on the questions to read the response.

1. Tell us about your journey with big data and Hadoop so far.


IBM has invested heavily in Hadoop to-date, and we’ll continue to do so. We see it as a key technology that solves a variety of problems for our customers. InfoSphere BigInsights is IBM's distribution of Hadoop. It incorporates 100% standard open-source components, but we’ve included enterprise-grade features, which is very important to our client base. We’ve done this in a way that preserves the openness of the platform, but also provides significant advantages, helping customers deploy solutions faster and more cost-efficiently. Each one of our enterprise grade features is an opt-in for customers. They can choose to remain purely on the open source capabilities if they choose to do so, and we provide support per IBM standard support models. I would like to point out that this support model provides a longer longevity than available from other pure play open source vendors.  

We have also been active contributors in specific projects that are of value to our enterprise customer base like HBase, metadata management, Security, encryption, in addition to a number of bug fixes in various projects of the ecosystem. We have also brought the power of Hadoop to the Power and Z Mainframe platforms.

A good example of our commitment is Big SQL -- IBM’s ANSI compliant SQL on Hadoop. Big SQL leverages more than 30 years of experience in database engineering. Big SQL works natively on Hadoop data sources, and interoperates seamlessly with open source tools. This is a big win for customers since they can use their existing investments in SQL tools, including Cognos, SPSS and other third-party tools. They can gain seamless access to data, regardless of where it’s stored. They don't need to re-write their queries into someone's less-capable dialect of the SQL. This not only saves time and money -- it simplifies the environment and helps customers deploy solutions faster and more reliably.

IBM believes that Hadoop has managed to become the heterogenous compute platform that allows us to run a variety of applications, especially analytics. While a number of initial implementations focused on basic warehouse style processing, data lakes, and optimizations of transformation workloads, they are now graduating to higher level analytics using polystructured data. . New geospatial data is being brought in, for example, to analyze accident proclivity in specific zip codes based on road conditions. Traffic information is being integrated with vehicle wear and tear information to influence insurance rating policies. Such types of analytics require a large amount of both compute and storage and Hadoop has made it possible.

2. One of the common questions is that does IBM Watson complement or substitute Hadoop ecosystem?


IBM Watson specializes in the Cognitive compute, Q&A style analytics and solutions. Hadoop is an essential underpinning for such a system, especially in an enterprise context. Data in enterprises is imperfect, and needs a set of curation and aggregation steps to ensure that the Watson magic can be applied. Information extraction, Entity integration are 2 key elements that go into this curation process. IBM’s distribution of Hadoop, BigInsights, provides comprehensive Text and machine learning capabilities, that are used in this curaton process by Watson. If I'm writing an application to parse human and human call-center conversations, for example, or an application to process social media feeds or log files, I can build and deploy these applications much faster and get better results, because IBM has already done the heavy lifting in its text and machine learning engins.. This means customers can start solving real business problems faster, and glean insights more easily, from their unstructured data.
We see the actual capture and storage of data in Hadoop as the easy part of the problem. Anybody can do this. We're focused on the analytic tooling that can help customers get value out of the information they're capturing.

3. Among the various big data use cases that you have implemented with customers, which one really made you say ‘wow’?


Organizations today are overwhelmed with Big Data. In fact, the world is generating more than 2.5 billion gigabytes of data every day, and 80 percent of it is “unstructured”—everything from images, video and audio to social media and a blizzard of impulses from embedded sensors and distributed devices. Applying analytics to data can help businesses handle the onslaught of information to help make better business decisions, but most importantly, even saves lives.
UCLA, for instance, is relying on breakthrough big data technology to help patients with traumatic brain injuries. At UNC Healthcare, doctors are using a big data and Smarter Care solution to identify high-risk patients, understand in context what’s causing them to be hospitalized, and then take preventative action.
Scientists and engineers at IBM continue to push the boundaries of science and technology with the goal of creating systems that sense, learn, reason and interact with people in new ways. It’s aimed at making the relationship between human and computer more natural so that increasingly, computers will offer advice, rather than waiting for commands.

4. Looking over the horizon, where do you see Hadoop market heading for in next 3 years’ time frame?


The success of open source Hadoop means organizations today can rethink their approach to how they handle information. Rather than taking the time to decide what information to retain and what to discard, they can keep all the data in Hadoop in case it’s need, which is much more cost efficient. With access to more data, management can get a better understanding of customers, their needs and how they use products and services.
This is pretty impressive, considering few people even knew about Hadoop five years ago.

We believe that the focus for Hadoop will shift from data ingest and processing, and related plumbing to interactive information discovery, and collaborative analytics over the large amounts of data it encourages to be stored. We see Spark as a fundamental enabler of a new generation of analytics applications, especially because of its unified programming and application development model, as well as its unique approach to in-memory processing.
Hadoop will continue to be the handle the heavy batch reporting and analytics workloads as well as become the next generation warehousing platform. However, we believe that the combination of Spark and Hadoop executing on the same substrate of storage and compute, will solve the one fundamental problem of conventional warehousing – make it accessible, actionable, and useful for the business users by spawning an entirely new set of analytics applications.
For this to happen, there needs to be a strong tool chain, that enables every business analyst into a data scientist over time. It is also imperative that we have reference architectures and frameworks that allow for standardizations between applications, so that they can exchange and collaborate with one another.

5. Given some super powers for a day, what you would like to do with it?


I have always believed in using technology to bring fundamental change in people’s lives. Big data has the promise to do such changes.
As a woman and a mother, I hold dear to my heart, the education of children all the way through college. With my superpowers, I would create a set of applications and advanced analytics built on Hadoop and Spark that would help teachers to understand the student needs, The set of applications would also help teachers identify drop off indicators from universities, provide better counseling what courses and majors that fit the strength of each individual. The applications will use Watson to help students to attend the schools that best fit them, and the right level of financial assistance so that they can get into the workforce with little debt, if at all.
If I can get all of that done in a day, am sure the next day will have a new dawn!


Read more »

Ciao latency, hallo speed

While Hadoop had long been suspect to higher latency engines due to inherent MapReduce design, the same does not hold true any longer. With advent of faster processing engines for Hadoop, distributed data can now be processed with lower latency and in more efficient manner. We continue our discussion with Stephan Ewen to find out more about Apache Flink for distributed data processing. In the first part of the discussion, we focused on technical aspects of Apache Flink's working. Now we turn our attention to the comparative use and fitment in the overall Hadoop ecosystem. Read below to find out what Ewen has to say. 

How does Apache Flink technically compare to Spark and are there any performance benefits?


Flink and Spark start from different points in the system design space. In the end, it is really a question of finding the right tool for a particular workload. Flink’s runtime has some unique features that are beneficial in certain workloads.

Flink uses data streaming rather than batch processing as much as possible to execute both batch and streaming programs. This means for streaming programs that they are executed in a real streaming fashion, with more flexible windows, lower latency, and long living operators. For batch programs, intermediate data sets are often piped to their consumers as they are created, saving on memory and disk I/O for data sets larger than memory.
Flink is memory-aware, operating on binary data rather than Java objects. This makes heavy data crunching inside the JVM efficient, and alleviates many of the problems that the JVM has for data-intensive workloads.
Flink optimizes programs in a pre-flight stage using a cost-based optimizer rather than eagerly sending programs to the cluster. This may have advantages in performance, and helps the debuggability of the programs.
Flink has dedicated iteration operators in the APIs and in the runtime. These operators support fast iterations and allow the system to be very efficient, for example, in case of graphs.

What lies ahead on the roadmap for Apache Flink in 2015?


The Flink community has recently discussed in the developer mailing list and published a roadmap for 2015. The roadmap includes more libraries and applications on top of Flink (e.g., a graph and a Machine Learning library), support for interactive programs, improvements to streaming functionality, performance, and robustness, as well as integration with other Apache and open source projects. (See here for more details on the roadmap)


In which use cases do you see Apache Flink being a good fit vis-à-vis other ecosystem options?


Flink’s batch programs shine when using data-intensive and compute intensive pipelines - and even more so when including iterative parts. This includes both complex ETL jobs, as well as data intensive machine learning algorithms. Flink’s architecture is designed to combine robustness with the ease of use and performance benefits of modern APIs and in-memory processing. A good example can be recommendation systems for objects like new movies on Netflix, or shopping articles on Amazon.

For data streaming use cases, the newly streaming API (beta status) offers beautiful high-level APIs with flexible windowing semantics, backed by a low-latency execution engine.

Graph algorithms work particularly well on Flink, due to its strong support for (stateful) iterative algorithms. As one of the first major libraries, Flink’s graph library “Gelly” has been added in its first version.



Stephan Ewen is committer and Vice President of Apache Flink and co-founder and CTO of data Artisans, a Berlin-based company that is developing and contributing to Apache Flink. Before founding data Artisans, Stephan was leading the development of Flink since the early days of the project (then called Stratosphere). Stephan holds a PhD in Computer Science from the University of Technology, Berlin, and has been with IBM Research and Microsoft Research in the course of several internships.


Read more »

Distributed data processing with Apache Flink

A new distributed engine named Apache Flink has been making its presence felt in the Hadoop ecosystem primarily due to its faster processing and expressive coding capabilities. To get a better idea on Apache Flink design and integration, HadoopSphere caught up with PMC chair Stephan Ewen and asked him more about the product. This is the first part of the interaction with Ewen where we ask him on technical aspects of Apache Flink.

How does Apache Flink work?


Flink is a distributed engine for batch and stream data processing. Its design draws inspiration from MapReduce, MPP databases, and general dataflow systems. Flink can work completely independent of existing technologies like Hadoop, but can run on top of HDFS and YARN.

Developers write Flink applications in fluent APIs in Java or Scala, based on parallel collections (data sets / data streams). The APIs are designed to feel familiar for people who know Hadoop, Apache Spark, Google Dataflow or SQL. For the latter, Flink adds a twist of SQL-style syntax to the API, with richer data types (beyond key/value pairs) whose fields can be used like attributes in SQL. Flink’s APIs also contain dedicated constructs for iterative programs, to support graph processing and machine learning efficiently.

Rather than executing the programs directly, Flink transforms them into a parallel data flow - a graph of parallel operators and data streams between those operators. The parallel data flow is executed on a Flink cluster, consisting of distributed worker processes, much like a Hadoop cluster.

Flink’s runtime is designed with both efficiency and robustness in mind, integrating various techniques in a unique blend of data pipelining, custom memory management, native iterations, and a specialized serialization framework. The runtime is able to exploit the benefits of in-memory processing, while being robust and efficient with memory and shuffles.

For batch programs, Flink additionally applies a series of optimizations while transforming the user programs to parallel data flow, using techniques from SQL databases, applied to Java/Scala programs. The optimizations are designed to reduce data shuffling or sorting, or automatically cache constant data sets in memory, for iterative programs.

What are the integration options available with Apache Flink?


Unlike MapReduce programs, Flink programs are embedded into regular Java/Scala programs, making them more flexible in interacting with other programs and data sources.

As an example, it is possible to define a program that starts with some HDFS files and joins in a SQL (JDBC) data source. Inside the same program, one can switch from the functional data set API to a graph processing paradigm (vertex centric) to define some computations, and then switch back to the functional paradigm for some the next steps. 

The Flink community is also currently working on integrating SQL statements into the functional API.





Stephan Ewen is committer and Vice President of Apache Flink and co-founder and CTO of data Artisans, a Berlin-based company that is developing and contributing to Apache Flink. Before founding data Artisans, Stephan was leading the development of Flink since the early days of the project (then called Stratosphere). Stephan holds a PhD in Computer Science from the University of Technology, Berlin, and has been with IBM Research and Microsoft Research in the course of several internships.

Read more »

Building an open source data warehouse with Apache Tajo

In our continuing series on Apache Tajo, we present the second part of interview with PMC chair Hyunsik Choi. While we earlier talked about the technical aspects of Apache Tajo in the first part of interview, we now further explore more on the fitment in the big data ecosystem. Tajo is designed for SQL queries on large data sets stored on HDFS and other data sources.Read below to find out more on how Apache Tajo can help you build an open source data warehouse.

How does Apache Tajo technically compare to Hive, Impala and Spark and what is the reason for performance difference?

In general view, Tajo is different from Hive, Spark, and Impala.
Tajo is a monolithic system for distributed relational processing. In contrast, Hive and Spark SQL are based on generic purpose processing systems (i.e., Tez and Spark respectively), and they also have additional layer for SQL(-like) language.
Tajo and Hive are implemented in Java, Impala is implemented C++, and Spark is implemented in Scala.

It may need micro benchmark in order to identify which parts give performance benefits. But, as far as I know, there hasn’t been such a study.
I can make a guess at some reasons as follows:
Tajo’s query optimizer is mature.
Tajo is specialized for distributed relational processing throughout whole stacks, and many parts including lower-level implementation are optimized for its purpose.
Tajo’s DAG framework can have more optimization opportunities by combinations of both various shuffle methods and flexible relational operator tree in stages. 
Tajo has more than 40 physical operators. Some of them are disk-based algorithms to spill data into disks, and others perform main-memory algorithms. Tajo physical planner chooses the best one according to a logical plan and available resources.
The hash shuffle of Tajo is very fast because it does the per-node hash shuffle, where all tuples associated with the same key are stored into a single file. So, this approach can exploit more sequential access during spilling and fetching. In contrast, other systems a disk-spilled shuffle method does task-level hash shuffle, causing a number of small random accesses.

Does Apache Tajo complement or substitute Apache Hive?

In 2010, Tajo was designed to an alternative to Apache Hive. At that time, there was no alternative to Hive. Now, we still are driving Tajo as an alternative to Hive. However, Tajo is also used as a complement system. Some users maintain both systems at the same time while they are migrating Hive workloads to Tajo. Such a migration is very easy because Tajo is compatible to most of Hive features except for SQL syntax; Tajo is an ANSI SQL standard compliance system. Also, some users replace only some Hive workloads by Apache Tajo due to more low response times without completely migrating Hive into Tajo. 

Can you take us through key steps in establishing a complete data warehouse with Apache Tajo?

First of all, you need to make a plan for the following:
which data sources (e.g., web logs, text files, JSON, HBase tables, RDBMs, …)
how long you will archive them
most dominant workloads and table schemas

The above factors may be similar when you consider RDBMS-based DW systems. But, Tajo supports in-situ processing on various data sources (e.g., HDFS, HBase) and  file formats (e.g., Parquet, SequenceFile, RCFile, Text, flat Json, and custom file formats). So, with Tajo you can maintain a warehouse by involving ETL process as well as archiving directly raw data sets without ETL. Also, depending on workloads and the archiving period, you need to consider table partitions, file formats, and compression policy. Recently, many Tajo users use Parquet in order to archive data sets, and Parquet also provides relatively faster query response times. Parquet’s storage space efficiency is great due to the advantages of columnar compression.

From archived data sets, you can build data marts which enable faster access against certain subject data sets. Due to its low latency, Tajo can be used as an OLAP engine to make reports and process interactive ad-hoc queries on data marts via BI tools and JDBC driver.



Hyunsik Choi is a director of research at Gruter Inc. which is a big data platform startup company located in Palo Alto CA. He is a co-founder of Apache Tajo project, which is an open source data warehouse system and is one of the Apache top-level projects. Since 2013, he has been a full-time contributor of Tajo. His recent interests are query compilation, cost-based optimization, and large-scale distributed systems. He also obtained a PhD degree from Korea University in 2013.

Read more »

Technical deep dive into Apache Tajo

Over the past few months, a new Hadoop based warehouse called Apache Tajo has been making the buzz. We tried to do a technical deep dive in the topic and reached out to PMC chair Hyunsik Choi. Apache Tajo is an Apache top level project since March 2014 and supports SQL standards. It has a powerful distributed processing architecture which is not based on MapReduce. To get more sense on Tajo's claims for providing a distributed, fault-tolerant, low-latency and high throughput SQL engine, we asked a few questions to Choi. This Q&A is published in a two part article series on hadoopsphere.com. Read below to get a better idea on Apache Tajo.

How does Apache Tajo work?

As you may already know, Tajo does not use MapReduce and has its own distributed processing framework which is flexible and specialized to relational processing. Since its storage manager is pluggable, Tajo can access data sets stored in various storages , such as HDFS, Amazon S3, Openstack Swift or local file system. In near future, we will add RDBMS-based storages and query optimization improvement to exploit indexing features of underlying RDBMSs. Tajo team also has developed its own query optimizer which is similar to those of RDBMSs but is specialized to Hadoop-scale distributed environments. The early optimizer is rule-based one, and the current optimizer has employed a combination of both cost-based join order and a rule-based optimization technique since 2013.
Figure 1: Apache Tajo component architecture

Figure 1 shows an overall architecture. Basically, a Tajo cluster instance consists of one TajoMaster and a number of Tajo Workers. TajoMaster coordinates cluster membership and their resources, and it also provides a gateway for clients. TajoWorker actually processes data sets stored in storages. When a user submits a SQL query, TajoMaster decides whether the query is immediately executed in only TajoMaster or the query is executed across a number of TajoWorkers. Depending on the decision, TajoMaster either forwards the query to workers or does not.
Since Tajo has originated from academia, its many parts are well defined. The Figure 2 shows a sequence of query planning steps. A single SQL statement passes these steps sequentially, and each intermediate form can be optimized in several parts.
Figure 2: SQL Query transformation in Apache Tajo

Aforementioned, Tajo does not use MapReduce and has its own distributed execution framework. Figure 3 shows an example of distributed execution plan. A distributed execution plan is a direct acyclic graph (DAG). In the figure, each rounded box indicates a processing stage, and each line between each rounded boxes indicates a data flow. A data flow can be specified with shuffle methods. Basically, groupby, Join, and sort require shuffles. Currently, three types of shuffle methods are supported: hash, range, and scattered hash. Hash shuffle are usually used for groupby and join, and range shuffle is usually used for sort.
Also, a stage (rounded box) can be specified by an internal DAG which represents a relational operator tree. Combining a relation operator tree and various shuffle methods enables generating flexible distributed plans.
Figure 3: Distributed query execution plan

Can you take an example of join operation and help us with understanding of Apache Tajo internals?

In the Figure 3, a join operation is included in the distributed plan. As I mentioned in the question #1, a join operation requires a hash shuffle. As a result, a join operation requires three stages, where two stages are for shuffle and one stage is for join. Each stage for shuffle hashes tuples with join keys. A join stage fetches two data sources  which are already hashed in two shuffle stages, and then executes a join physical operator (also called join algorithm) with two data sources.
Even though ‘join’ is represented in Figure 3, Tajo has two inner join algorithms: in-memory hash join and sort-merge join. If input data sets fit into main memory, hash join is usually chosen. Otherwise, sort-merge join is mostly chosen.

Hyunsik Choi is a director of research at Gruter Inc. which is a big data platform startup company located in Palo Alto CA. He is a co-founder of Apache Tajo project, which is an open source data warehouse system and is one of the Apache top-level projects. Since 2013, he has been a full-time contributor of Tajo. His recent interests are query compilation, cost-based optimization, and large-scale distributed systems. He also obtained a PhD degree from Korea University in 2013.

Read more »

Top 5 Big Data and Hadoop Trends in 2014

As the year 2014 bid us goodbye, let’s uncover some of the key trends that dominated the big data and hadoop arena during the year. There were some key themes that came to the fore and considering big data is dominating the technology investments now, these trends are indicative of the path that the entire information technology world is taking.

(1)    Real time was the flavor of the year:

Much has been written about real time big data analytics or rather, the lack of it in traditional MapReduce world. There were able products in the form of Apache Tez, Cloudera Impala, IBM BigSQL, Apache Drill and Pivotal HAWQ that were unleashed in 2013. And as the adage goes, 2013 was history in 2014. Apache Spark took center stage and ensured that everyone talked about near-real-time at least. Apache Storm also got its associated lime light alongside and the rest in the streaming pack within industry. Real time big data is here and it is here to get better with each product’s release.

(2)    R&D came to the fore:

Not just Google, Microsoft, IBM, SAP and the likes, many other exciting labs are coming to the party and investing huge dollars in big data R&D. Machine learning is passe as the real interest shifts to deep learning. Backed by years of research interest in artificial intelligence, neural networks and more, the R&D in deep learning has found a new zest. Large industry players like AT&T Research as well as emerging companies like Impetus Technologies continued to invest in big data warehouse research and brought in senior executives from other companies (like IBM) to ensure they research it right and develop it hard.

(3)    The big boys kept struggling:

Big data has never been a ground that has been dominated by the big boys of IT. Rather the new kids on the block, Cloudera, Hortonworks and MapR have dominated the space and continued to do so in 2014. With a billion $ IPO under Hortonworks’ belt, things have never been so good for emerging product companies in the technology sector. These new companies are here to stay and give many sleepless nights to the sales execs of established product companies.

(4)    It’s a man’s world:

Strange but true – whether you visit a conference, seminar or development shop of big data, there have hardly been any women in the arena. Call it the invisible ceiling within big data industry but with an exception of Anjul Bhambhri, it is rare to see a woman dominating the scene. Even the proportion of women developers/architects/managers seems to be abysmally low – something we hope to see corrected in 2015 as more people take up Hadoop skills.

(5)    The shift in services world:


Ah, the cream of the revenue pie – professional services and consulting. Not just the product world, the services world has shown some interesting trends in emergence of new players. Big data services unlike traditional analytics is still dominated by specialized players rather than CMMi certified IT majors. By 2015 end, we should be able to see some new big names on the horizon. These companies do not have armies of certified professionals but rather have been establishing themselves by delivering successful big data specialized solutions from small experienced teams.


Read more »