Governance in a data lake

The need for defining a robust data governance layer is becoming an essential requirement for an enterprise data lake. Continuing our discussion on data governance, we focus on Apache Falcon as a solution option for governing the data pipelines. HadoopSphere discussed with Srikanth Sundarrajan, VP of Apache Falcon, about the product as well as the data governance requirements. In the first part of the interview, we talked about Falcon's architecture. We further discuss the functional aspects in the interaction below. 

What lies ahead on the roadmap of Apache Falcon for 2015?

Major focus areas for Apache Falcon in 2015 and beyond:
Entity management and Instance administration dashboard – Currently CLI based administration is very limiting and the real power of the dependency information available within Falcon can’t be unlocked without an appropriate visual interface. Also entity management complexities can be cut down through a friendlier UI.
Recipes – Today Falcon supports notion of a process to perform some action over data. But there are standard and routine operations that may be applicable for a wide range of users. Falcon project is currently working on enabling this through the notion of recipe. This will enable users to convert their standard data routines into templates for reuse and more importantly some common templates can be shared across users/organizations.
Life cycle – Falcon supports standard data management functions off the shelf, however the same doesn’t cater to every user’s requirement and might require customization. Falcon team is currently working on opening this up and allowing this to be customized per deployment to cater to specific needs of a user.
Operational simplification – When Falcon becomes the de-facto platform (as is the case with some of the users), the richness of dependency information contained can be leveraged to operationally simplify how data processing is managed. Today handling infrastructure outage/maintenance or degradation or application failures can stall large pipelines causing cascading issues. Dependency information in Falcon can be used to seamlessly recover from these without any manual intervention.
Pipeline designer – This is a forward-looking capability in Falcon that enables big data ETL pipelines to be authored visually. This would generate code in language such as Apache Pig and wrap them in appropriate Falcon process and define appropriate feeds.

Can you elaborate on key desired components of big data governance regardless of tool capabilities at this stage?

Security, Quality, Provenance and Privacy are fundamental when it comes to data governance
Quality – Quality of data is one of the most critical components and there has to be convenient ways to both audit the system for data quality and also build proactive mechanism to cut out any sources of inaccuracies
Provenance – Organizations typically have complex data flows and often times it is challenging to figure the lineage of this data. To be able to get this lineage at a dataset level, field level and at a record level (in that order of importance) is very important.
Security – This is fundamental and hygiene to any data system. Authentication, Authorization and Audit trail are non-negotiable. Every user has to be authenticated and all access to data is to be authorized and audited.
Privacy – Data anonymization is one of the key techniques to conform to laws and regulation of the land. This is something that the data systems have to natively support or enable.

Why would an enterprise not prefer to use commercial tools (like Informatica) and rather use open source Apache Falcon?

Apache Falcon is a Hadoop first data management system and integrates well with standard components in the big data open source eco systems that are widely adopted. This native integration with Hadoop is what makes it a tool of choice. Apache Falcon being available under liberal APL 2.0 license and housed under ASF allows users/organizations to experiment with it easily and also enable them to contribute their extensions. Recent elevation of Apache Falcon to a top-level project also assures the users about the community driven development process adopted within the Falcon project.

If someone is using Cloudera distribution, what are the options for him?

Apache Falcon is distribution agnostic and should work (with some minor tweaks) for anyone using Apache Hadoop 2.5.0 and above along with Oozie 4.1.0.  There are plenty of users who use Apache Falcon along with HDP. One of the largest users of Apache Falcon has used it along with CDH 3 and CDH 4, and there are some users who have tried using Apache Falcon with MapR distribution as well.

Srikanth Sundarrajan works at Inmobi Technology Services, helping architect and build their next generation data management system. He is one of the key contributors to Apache Falcon and currently VP of the project. He has been involved in various projects under the Apache Hadoop umbrella including Apache Lens, Apache Hadoop-core, and Apache Oozie. He has been working with distributed processing systems for over a decade and Hadoop in particular over the last 7 years. Srikanth holds a graduate degree in Computer Engineering from University of Southern California.

Read more »

Data pipelines with Apache Falcon

Over the past few months, Apache Falcon has been gaining traction as a data governance engine for defining, scheduling, and monitoring data management policies. Apache Falcon is a feed processing and feed management system aimed at making it easier for Hadoop administrators to define their data pipelines and auto-generate workflows in Apache Oozie.

HadoopSphere interacted with Srikanth Sundarrajan, VP of Apache Falcon, to understand the usage and functional intricacies of the product. This is what Srikanth told us:

What is the objective of Apache Falcon and what use cases does it fit in?

Apache Falcon is a tool focused on simplifying data and pipeline management for large-scale data, particularly stored and processed through Apache Hadoop. Falcon system provides standard data life cycle management functions such as data replication, eviction, archival while also providing strong orchestration capabilities for pipelines. Falcon maintains dependency information between data elements and processing elements. This dependency information can be used to provide data provenance besides simplifying pipeline management.

Today Apache Falcon system is being used widely for addressing the following use cases:
Declaring and managing data retention policy
Data mirroring, replication, DR backup 
Data pipeline orchestration
Tracking data lineage & provenance
ETL for Hadoop

The original version of Apache Falcon was built back in early 2012. InMobi Technologies, one of the largest users of the system, has been using this as a de-facto platform, for managing their ETL pipelines for reporting & analytics, model training, enforcing data retention policies, data archival and large scale data movement across WAN. Another interesting application where Falcon is being used is to run identical pipelines in multiple sites on local data and results merged in a single location.

How does Apache Falcon work? Can you describe to us major components and their function?

Apache Falcon has three basic entities that it operates with:
- firstly cluster, which represent physical infrastructure, 
- secondly feed which represents a logical data element with periodicity and, 
- thirdly process, which represent a logical processing element that may depend on one or more data elements and may produce one or more data elements. 

At the core of it Apache Falcon maintains two graphs, (1) one is the entity graph and (2) the other is instance graph. The entity graph captures the dependencies between the entities and is maintained in an in-memory structure. The instance graph on the other hand maintains information about instances that have been processed and their dependencies. This is stored on a blueprint compatible graph database.
Falcon Embedded Mode

Falcon system integrates with Apache Oozie for all its orchestration requirements, be it a data life-cycle function or process execution. While control remains with Oozie for all the workflows and its execution, Falcon injects pre and post processing hooks into the flows, allowing Falcon to learn about the execution and completion status of each workflow executed by Oozie. Post processing hook essentially sends a notification via JMS to the Falcon server, which then uses this control signal for supporting other features such as Re-runs and CDC (change data capture).

Falcon has two modes of deployment, embedded and distributed. In Embedded model the Falcon server is complete by itself and provides all the capabilities. In distributed mode the Prism, a lightweight proxy takes center stage and provides a veneer over multiple Falcon instances, which may be run in different geographies. The prism ensures that the falcon entities are in sync across the Falcon server and provides a global view of everything happening on the system.

Falcon Distributed Mode

Falcon system has a REST based interface and a CLI over it. It integrates with Kerberos for providing authentication and minimal authorization capabilities.

Srikanth Sundarrajan works at Inmobi Technology Services, helping architect and build their next generation data management system. He is one of the key contributors to Apache Falcon and currently VP of the project. He has been involved in various projects under the Apache Hadoop umbrella including Apache Lens, Apache Hadoop-core, and Apache Oozie. He has been working with distributed processing systems for over a decade and Hadoop in particular over the last 7 years. Srikanth holds a graduate degree in Computer Engineering from University of Southern California.

Read more »

Top Big Data influencers of 2014

Big Data is an exciting technology space innovating at a pace probably never seen before. With a dynamic ecosystem and scorching pace of product development, it is easy to be left behind. However, thanks to visionaries in this ecosystem who have been able to decode the maze and set things right for us, we have been seeing successful big data use cases and implementations. 

HadoopSphere presents below its annual list of top big data influencers. This list reflects the people, products, organizations and portals that exercised the most influence on big data and ecosystem in a particular year. The influencers have been listed in the following categories:

  • Analysts
  • Online Media
  • Products
  • Social Media
  • Angels
  • Thought Leaders
The info-graphic below shows the Top Big Data influencers of 2014 as ranked by HadoopSphere. 


Mike Gualtieri In 2013, Mike Gualtieri of Forrester had predicted that big data will be the Time person of the year. Well, it almost came true with a big data use (or misuse) case (Edward Snowden) making it to the runner up of Time person of the year. Besides occasionally playing sorcerer, Mike has remained one of the most well respected analyst in year 2014 commanding a comprehensive vison and view for the data ecosystem.

Curt Monash If you have not been reading Curt Monash, you may have been living on an island probably. And, if you got a few incisive comments on your product, well, then you are probably part of an urban elite in this big data city. Don’t expect courtesies, just expect plain honest assessment and that too with technical depth from Curt.

Tony Baer Consistency and clarity are the forte of Tony Baer, Ovum’s principal analyst. Presume a consistent sane advice with clear cut guidance on what to expect and what not to expect from Tony. He has remained a top influencer in big data and Hadoop ecosystem consistently for another year.

Online Media:

TDWI With research papers, blogs, webinars and education events, TDWI continued to attract eye-balls and sponsors alike making it one of the top focused industry portals.

IBM IBM is a technology company but runs a media machinery of its own. Its data initiatives like IBM Data Magazine, IBM Big Data Hub, Big Data University, Developer Works, Red books combined together continued to be among top traffic getters. Though the content may be in part IBM specific, overall it did a great work of educating big data community.

DZone With ‘smart content’ for big data professionals, DZone continued to encourage community to contribute links, articles, guides and ‘refcardz’. DZone ensured both quality and good volume traffic resulting in a high influence on techies.


Spark Do we need to say anything about this obvious choice? Apache Spark has been the flavor of all seasons since 2014 beginning. With biggest open source community in big data ecosystem, it continued to define and influence the shape of future products.

Scala Although technically a programming language and not a product, Scala is listed here as it marches its way ahead to become a preferred language for big data programming. With both Apache Spark and Flink promoting it big time, the simplicity and power of the language became more obvious. We expect Scala to become one of the most powerful languages in few years.

Kafka If you need to quote an example of word-of-mouth success, here it is. Apache Kafka was developed at LinkedIn and was not a part of major Hadoop distributions till early 2015. However, still it has emerged as a preferred choice for data ingestion and has seen adoption by internet companies, financial majors and travel portals among others.

Social Media:

Ben Lorica If one has a dream twitter handle like ‘bigdata’, it may not be sheer co-incidence. It probably shows the handle owner has been talking about big data before we heard of it. Ben Lorica is the Chief Data Scientist and Director of Content Strategy for Data at O'Reilly Media, Inc and commands the ‘bigdata’ twitter handle with its impressive following.

Gregory Piatetsky-Shapiro As the President of KDnuggets, which provides analytics and data mining consulting, Gregory is a founder of KDD (Knowledge Discovery and Data mining conferences). He is one of the leading social influencers with his mentions generating huge follower interest.

Kirk Borne Dr.Kirk Borne is Professor of Astrophysics and Computational Science at George Mason University. As a data scientist and astrophysicist, he mostly talks about big data on social media and continues to attract huge follower base.

Angel Investors:

Naval Ravikant Entrepreneur and an angel investor, Naval Ravikant is co-founder of AngelList. Through this terrific forum and other offline activities, he has been drumming up the cause of many startups and taking them through the funding gates.

Data Collective DCVC (aka Data Collective) is a seed and early stage venture capital fund that invests in big data companies. Its extended team consists of more than 35 “Equity Partners,” who are notable technical founders and executives, data scientists and engineers. Some of the notable portfolio companies include Blue Data, Continuity, Elasticsearch, Citus Data.

Thought Leaders:

Mike Olson As the Chief Strategy Officer of Cloudera, Mike Olson is a leader whose vision has been driving his company and much of the Hadoop ecosystem. His unbridled passion combined with ability to foresee market dynamics makes him one of the biggest thought leaders and influencers in entire information technology arena. From marketing Hadoop to touting Impala as MPP or mentoring competitive Spark, Mike has exhibited unparalleled transformational leadership characteristics.

Merv Adrian Merv Adrian is Research VP at Gartner and the more known face of the research company in social media and event circles. Each year Gartner somehow manages to hit a rough note with the big data vendors, be it “trough of disillusionment” or “data lake fallacy” comment. However, Merv with his astute knowledge of Hadoop ecosystem, BI world and technology lifecycles has made people understand the discordant notes to apply caution, restrain and intelligence beyond the obvious hype. And, that’s what thought leaders do – create sense and path out of chaos and conflicts. Pro Tip: Merv may not agree with you but will still have you and him understand a common path.

<< Top big data influencers of 2013

Get trained in Hadoop, Spark and Big Data technologies - Enroll now
Read more »

Hadoop High 5 with IBM's Anjul Bhambhri

In the thought leadership series called 'Hadoop High 5', we ask leaders in big data and hadoop ecosystem on the vision and the future path. Continuing this series, we asked Anjul Bhambhri five questions. Anjul Bhambhri is the Vice President of Big Data at IBM. She was previously the Director of IBM Optim application and data life cycle management tools. She is a seasoned professional with over twenty-five years in the database industry. Over this time, Anjul has held various engineering and management positions at IBM, Informix and Sybase. Prior to her assignment in tools, Anjul spearheaded the development of XML capabilities in IBM's DB2 database server. She is a recipient of the YWCA of Silicon Valley's “Tribute to Women in Technology” award for 2009. Anjul holds a degree in Electrical Engineering. 
Click on the questions to read the response.

1. Tell us about your journey with big data and Hadoop so far.

IBM has invested heavily in Hadoop to-date, and we’ll continue to do so. We see it as a key technology that solves a variety of problems for our customers. InfoSphere BigInsights is IBM's distribution of Hadoop. It incorporates 100% standard open-source components, but we’ve included enterprise-grade features, which is very important to our client base. We’ve done this in a way that preserves the openness of the platform, but also provides significant advantages, helping customers deploy solutions faster and more cost-efficiently. Each one of our enterprise grade features is an opt-in for customers. They can choose to remain purely on the open source capabilities if they choose to do so, and we provide support per IBM standard support models. I would like to point out that this support model provides a longer longevity than available from other pure play open source vendors.  

We have also been active contributors in specific projects that are of value to our enterprise customer base like HBase, metadata management, Security, encryption, in addition to a number of bug fixes in various projects of the ecosystem. We have also brought the power of Hadoop to the Power and Z Mainframe platforms.

A good example of our commitment is Big SQL -- IBM’s ANSI compliant SQL on Hadoop. Big SQL leverages more than 30 years of experience in database engineering. Big SQL works natively on Hadoop data sources, and interoperates seamlessly with open source tools. This is a big win for customers since they can use their existing investments in SQL tools, including Cognos, SPSS and other third-party tools. They can gain seamless access to data, regardless of where it’s stored. They don't need to re-write their queries into someone's less-capable dialect of the SQL. This not only saves time and money -- it simplifies the environment and helps customers deploy solutions faster and more reliably.

IBM believes that Hadoop has managed to become the heterogenous compute platform that allows us to run a variety of applications, especially analytics. While a number of initial implementations focused on basic warehouse style processing, data lakes, and optimizations of transformation workloads, they are now graduating to higher level analytics using polystructured data. . New geospatial data is being brought in, for example, to analyze accident proclivity in specific zip codes based on road conditions. Traffic information is being integrated with vehicle wear and tear information to influence insurance rating policies. Such types of analytics require a large amount of both compute and storage and Hadoop has made it possible.

2. One of the common questions is that does IBM Watson complement or substitute Hadoop ecosystem?

IBM Watson specializes in the Cognitive compute, Q&A style analytics and solutions. Hadoop is an essential underpinning for such a system, especially in an enterprise context. Data in enterprises is imperfect, and needs a set of curation and aggregation steps to ensure that the Watson magic can be applied. Information extraction, Entity integration are 2 key elements that go into this curation process. IBM’s distribution of Hadoop, BigInsights, provides comprehensive Text and machine learning capabilities, that are used in this curaton process by Watson. If I'm writing an application to parse human and human call-center conversations, for example, or an application to process social media feeds or log files, I can build and deploy these applications much faster and get better results, because IBM has already done the heavy lifting in its text and machine learning engins.. This means customers can start solving real business problems faster, and glean insights more easily, from their unstructured data.
We see the actual capture and storage of data in Hadoop as the easy part of the problem. Anybody can do this. We're focused on the analytic tooling that can help customers get value out of the information they're capturing.

3. Among the various big data use cases that you have implemented with customers, which one really made you say ‘wow’?

Organizations today are overwhelmed with Big Data. In fact, the world is generating more than 2.5 billion gigabytes of data every day, and 80 percent of it is “unstructured”—everything from images, video and audio to social media and a blizzard of impulses from embedded sensors and distributed devices. Applying analytics to data can help businesses handle the onslaught of information to help make better business decisions, but most importantly, even saves lives.
UCLA, for instance, is relying on breakthrough big data technology to help patients with traumatic brain injuries. At UNC Healthcare, doctors are using a big data and Smarter Care solution to identify high-risk patients, understand in context what’s causing them to be hospitalized, and then take preventative action.
Scientists and engineers at IBM continue to push the boundaries of science and technology with the goal of creating systems that sense, learn, reason and interact with people in new ways. It’s aimed at making the relationship between human and computer more natural so that increasingly, computers will offer advice, rather than waiting for commands.

4. Looking over the horizon, where do you see Hadoop market heading for in next 3 years’ time frame?

The success of open source Hadoop means organizations today can rethink their approach to how they handle information. Rather than taking the time to decide what information to retain and what to discard, they can keep all the data in Hadoop in case it’s need, which is much more cost efficient. With access to more data, management can get a better understanding of customers, their needs and how they use products and services.
This is pretty impressive, considering few people even knew about Hadoop five years ago.

We believe that the focus for Hadoop will shift from data ingest and processing, and related plumbing to interactive information discovery, and collaborative analytics over the large amounts of data it encourages to be stored. We see Spark as a fundamental enabler of a new generation of analytics applications, especially because of its unified programming and application development model, as well as its unique approach to in-memory processing.
Hadoop will continue to be the handle the heavy batch reporting and analytics workloads as well as become the next generation warehousing platform. However, we believe that the combination of Spark and Hadoop executing on the same substrate of storage and compute, will solve the one fundamental problem of conventional warehousing – make it accessible, actionable, and useful for the business users by spawning an entirely new set of analytics applications.
For this to happen, there needs to be a strong tool chain, that enables every business analyst into a data scientist over time. It is also imperative that we have reference architectures and frameworks that allow for standardizations between applications, so that they can exchange and collaborate with one another.

5. Given some super powers for a day, what you would like to do with it?

I have always believed in using technology to bring fundamental change in people’s lives. Big data has the promise to do such changes.
As a woman and a mother, I hold dear to my heart, the education of children all the way through college. With my superpowers, I would create a set of applications and advanced analytics built on Hadoop and Spark that would help teachers to understand the student needs, The set of applications would also help teachers identify drop off indicators from universities, provide better counseling what courses and majors that fit the strength of each individual. The applications will use Watson to help students to attend the schools that best fit them, and the right level of financial assistance so that they can get into the workforce with little debt, if at all.
If I can get all of that done in a day, am sure the next day will have a new dawn!

Read more »

Ciao latency, hallo speed

While Hadoop had long been suspect to higher latency engines due to inherent MapReduce design, the same does not hold true any longer. With advent of faster processing engines for Hadoop, distributed data can now be processed with lower latency and in more efficient manner. We continue our discussion with Stephan Ewen to find out more about Apache Flink for distributed data processing. In the first part of the discussion, we focused on technical aspects of Apache Flink's working. Now we turn our attention to the comparative use and fitment in the overall Hadoop ecosystem. Read below to find out what Ewen has to say. 

How does Apache Flink technically compare to Spark and are there any performance benefits?

Flink and Spark start from different points in the system design space. In the end, it is really a question of finding the right tool for a particular workload. Flink’s runtime has some unique features that are beneficial in certain workloads.

Flink uses data streaming rather than batch processing as much as possible to execute both batch and streaming programs. This means for streaming programs that they are executed in a real streaming fashion, with more flexible windows, lower latency, and long living operators. For batch programs, intermediate data sets are often piped to their consumers as they are created, saving on memory and disk I/O for data sets larger than memory.
Flink is memory-aware, operating on binary data rather than Java objects. This makes heavy data crunching inside the JVM efficient, and alleviates many of the problems that the JVM has for data-intensive workloads.
Flink optimizes programs in a pre-flight stage using a cost-based optimizer rather than eagerly sending programs to the cluster. This may have advantages in performance, and helps the debuggability of the programs.
Flink has dedicated iteration operators in the APIs and in the runtime. These operators support fast iterations and allow the system to be very efficient, for example, in case of graphs.

What lies ahead on the roadmap for Apache Flink in 2015?

The Flink community has recently discussed in the developer mailing list and published a roadmap for 2015. The roadmap includes more libraries and applications on top of Flink (e.g., a graph and a Machine Learning library), support for interactive programs, improvements to streaming functionality, performance, and robustness, as well as integration with other Apache and open source projects. (See here for more details on the roadmap)

In which use cases do you see Apache Flink being a good fit vis-à-vis other ecosystem options?

Flink’s batch programs shine when using data-intensive and compute intensive pipelines - and even more so when including iterative parts. This includes both complex ETL jobs, as well as data intensive machine learning algorithms. Flink’s architecture is designed to combine robustness with the ease of use and performance benefits of modern APIs and in-memory processing. A good example can be recommendation systems for objects like new movies on Netflix, or shopping articles on Amazon.

For data streaming use cases, the newly streaming API (beta status) offers beautiful high-level APIs with flexible windowing semantics, backed by a low-latency execution engine.

Graph algorithms work particularly well on Flink, due to its strong support for (stateful) iterative algorithms. As one of the first major libraries, Flink’s graph library “Gelly” has been added in its first version.

Stephan Ewen is committer and Vice President of Apache Flink and co-founder and CTO of data Artisans, a Berlin-based company that is developing and contributing to Apache Flink. Before founding data Artisans, Stephan was leading the development of Flink since the early days of the project (then called Stratosphere). Stephan holds a PhD in Computer Science from the University of Technology, Berlin, and has been with IBM Research and Microsoft Research in the course of several internships.

Read more »

Distributed data processing with Apache Flink

A new distributed engine named Apache Flink has been making its presence felt in the Hadoop ecosystem primarily due to its faster processing and expressive coding capabilities. To get a better idea on Apache Flink design and integration, HadoopSphere caught up with PMC chair Stephan Ewen and asked him more about the product. This is the first part of the interaction with Ewen where we ask him on technical aspects of Apache Flink.

How does Apache Flink work?

Flink is a distributed engine for batch and stream data processing. Its design draws inspiration from MapReduce, MPP databases, and general dataflow systems. Flink can work completely independent of existing technologies like Hadoop, but can run on top of HDFS and YARN.

Developers write Flink applications in fluent APIs in Java or Scala, based on parallel collections (data sets / data streams). The APIs are designed to feel familiar for people who know Hadoop, Apache Spark, Google Dataflow or SQL. For the latter, Flink adds a twist of SQL-style syntax to the API, with richer data types (beyond key/value pairs) whose fields can be used like attributes in SQL. Flink’s APIs also contain dedicated constructs for iterative programs, to support graph processing and machine learning efficiently.

Rather than executing the programs directly, Flink transforms them into a parallel data flow - a graph of parallel operators and data streams between those operators. The parallel data flow is executed on a Flink cluster, consisting of distributed worker processes, much like a Hadoop cluster.

Flink’s runtime is designed with both efficiency and robustness in mind, integrating various techniques in a unique blend of data pipelining, custom memory management, native iterations, and a specialized serialization framework. The runtime is able to exploit the benefits of in-memory processing, while being robust and efficient with memory and shuffles.

For batch programs, Flink additionally applies a series of optimizations while transforming the user programs to parallel data flow, using techniques from SQL databases, applied to Java/Scala programs. The optimizations are designed to reduce data shuffling or sorting, or automatically cache constant data sets in memory, for iterative programs.

What are the integration options available with Apache Flink?

Unlike MapReduce programs, Flink programs are embedded into regular Java/Scala programs, making them more flexible in interacting with other programs and data sources.

As an example, it is possible to define a program that starts with some HDFS files and joins in a SQL (JDBC) data source. Inside the same program, one can switch from the functional data set API to a graph processing paradigm (vertex centric) to define some computations, and then switch back to the functional paradigm for some the next steps. 

The Flink community is also currently working on integrating SQL statements into the functional API.

Stephan Ewen is committer and Vice President of Apache Flink and co-founder and CTO of data Artisans, a Berlin-based company that is developing and contributing to Apache Flink. Before founding data Artisans, Stephan was leading the development of Flink since the early days of the project (then called Stratosphere). Stephan holds a PhD in Computer Science from the University of Technology, Berlin, and has been with IBM Research and Microsoft Research in the course of several internships.

Read more »