Skip to main content

Data pipelines with Apache Falcon

Over the past few months, Apache Falcon has been gaining traction as a data governance engine for defining, scheduling, and monitoring data management policies. Apache Falcon is a feed processing and feed management system aimed at making it easier for Hadoop administrators to define their data pipelines and auto-generate workflows in Apache Oozie.

HadoopSphere interacted with Srikanth Sundarrajan, VP of Apache Falcon, to understand the usage and functional intricacies of the product. This is what Srikanth told us:

What is the objective of Apache Falcon and what use cases does it fit in?

Apache Falcon is a tool focused on simplifying data and pipeline management for large-scale data, particularly stored and processed through Apache Hadoop. Falcon system provides standard data life cycle management functions such as data replication, eviction, archival while also providing strong orchestration capabilities for pipelines. Falcon maintains dependency information between data elements and processing elements. This dependency information can be used to provide data provenance besides simplifying pipeline management.

Today Apache Falcon system is being used widely for addressing the following use cases:
Declaring and managing data retention policy
Data mirroring, replication, DR backup 
Data pipeline orchestration
Tracking data lineage & provenance
ETL for Hadoop

The original version of Apache Falcon was built back in early 2012. InMobi Technologies, one of the largest users of the system, has been using this as a de-facto platform, for managing their ETL pipelines for reporting & analytics, model training, enforcing data retention policies, data archival and large scale data movement across WAN. Another interesting application where Falcon is being used is to run identical pipelines in multiple sites on local data and results merged in a single location.

How does Apache Falcon work? Can you describe to us major components and their function?

Apache Falcon has three basic entities that it operates with:
- firstly cluster, which represent physical infrastructure, 
- secondly feed which represents a logical data element with periodicity and, 
- thirdly process, which represent a logical processing element that may depend on one or more data elements and may produce one or more data elements. 

At the core of it Apache Falcon maintains two graphs, (1) one is the entity graph and (2) the other is instance graph. The entity graph captures the dependencies between the entities and is maintained in an in-memory structure. The instance graph on the other hand maintains information about instances that have been processed and their dependencies. This is stored on a blueprint compatible graph database.
Falcon Embedded Mode

Falcon system integrates with Apache Oozie for all its orchestration requirements, be it a data life-cycle function or process execution. While control remains with Oozie for all the workflows and its execution, Falcon injects pre and post processing hooks into the flows, allowing Falcon to learn about the execution and completion status of each workflow executed by Oozie. Post processing hook essentially sends a notification via JMS to the Falcon server, which then uses this control signal for supporting other features such as Re-runs and CDC (change data capture).

Falcon has two modes of deployment, embedded and distributed. In Embedded model the Falcon server is complete by itself and provides all the capabilities. In distributed mode the Prism, a lightweight proxy takes center stage and provides a veneer over multiple Falcon instances, which may be run in different geographies. The prism ensures that the falcon entities are in sync across the Falcon server and provides a global view of everything happening on the system.

Falcon Distributed Mode

Falcon system has a REST based interface and a CLI over it. It integrates with Kerberos for providing authentication and minimal authorization capabilities.

Srikanth Sundarrajan works at Inmobi Technology Services, helping architect and build their next generation data management system. He is one of the key contributors to Apache Falcon and currently VP of the project. He has been involved in various projects under the Apache Hadoop umbrella including Apache Lens, Apache Hadoop-core, and Apache Oozie. He has been working with distributed processing systems for over a decade and Hadoop in particular over the last 7 years. Srikanth holds a graduate degree in Computer Engineering from University of Southern California.


  1. Hello,

    Concerning feature : • Data mirroring, replication, DR backup

    It is possible between to hadoop platform clusters or just in the same hadoop cluster ?


Post a Comment

Popular articles

5 online tools in data visualization playground

While building up an analytics dashboard, one of the major decision points is regarding the type of charts and graphs that would provide better insight into the data. To avoid a lot of re-work later, it makes sense to try the various chart options during the requirement and design phase. It is probably a well known myth that existing tool options in any product can serve all the user requirements with just minor configuration changes. We all know and realize that code needs to be written to serve each customer’s individual needs. To that effect, here are 5 tools that could empower your technical and business teams to decide on visualization options during the requirement phase. Listed below are online tools for you to add data and use as playground. 1)      Many Eyes : Many Eyes is a data visualization experiment by IBM Research and the IBM Cognos software group. This tool provides option to upload data sets and create visualizations including Scatter Plot, Tree Ma

Data deduplication tactics with HDFS and MapReduce

As the amount of data continues to grow exponentially, there has been increased focus on stored data reduction methods. Data compression, single instance store and data deduplication are among the common techniques employed for stored data reduction. Deduplication often refers to elimination of redundant subfiles (also known as chunks, blocks, or extents). Unlike compression, data is not changed and eliminates storage capacity for identical data. Data deduplication offers significant advantage in terms of reduction in storage, network bandwidth and promises increased scalability. From a simplistic use case perspective, we can see application in removing duplicates in Call Detail Record (CDR) for a Telecom carrier. Similarly, we may apply the technique to optimize on network traffic carrying the same data packets. Some of the common methods for data deduplication in storage architecture include hashing, binary comparison and delta differencing. In this post, we focus o

In-memory data model with Apache Gora

Open source in-memory data model and persistence for big data framework Apache Gora™ version 0.3, was released in May 2013. The 0.3 release offers significant improvements and changes to a number of modules including a number of bug fixes. However, what may be of significant interest to the DynamoDB community will be the addition of a gora-dynamodb datastore for mapping and persisting objects to Amazon's DynamoDB . Additionally the release includes various improvements to the gora-core and gora-cassandra modules as well as a new Web Services API implementation which enables users to extend Gora to any cloud storage platform of their choice. This 2-part post provides commentary on all of the above and a whole lot more, expanding to cover where Gora fits in within the NoSQL and Big Data space, the development challenges and features which have been baked into Gora 0.3 and finally what we have on the road map for the 0.4 development drive. Introducing Apache Gora Although