Over the past few months, Apache Falcon has been gaining traction as a data governance engine for defining, scheduling, and monitoring data management policies. Apache Falcon is a feed processing and feed management system aimed at making it easier for Hadoop administrators to define their data pipelines and auto-generate workflows in Apache Oozie.
HadoopSphere interacted with Srikanth Sundarrajan, VP of Apache Falcon, to understand the usage and functional intricacies of the product. This is what Srikanth told us:
What is the objective of Apache Falcon and what use cases does it fit in?
Apache Falcon is a tool focused on simplifying data and pipeline management for large-scale data, particularly stored and processed through Apache Hadoop. Falcon system provides standard data life cycle management functions such as data replication, eviction, archival while also providing strong orchestration capabilities for pipelines. Falcon maintains dependency information between data elements and processing elements. This dependency information can be used to provide data provenance besides simplifying pipeline management.
Today Apache Falcon system is being used widely for addressing the following use cases:
• Declaring and managing data retention policy
• Data mirroring, replication, DR backup
• Data pipeline orchestration
• Tracking data lineage & provenance
• ETL for Hadoop
The original version of Apache Falcon was built back in early 2012. InMobi Technologies, one of the largest users of the system, has been using this as a de-facto platform, for managing their ETL pipelines for reporting & analytics, model training, enforcing data retention policies, data archival and large scale data movement across WAN. Another interesting application where Falcon is being used is to run identical pipelines in multiple sites on local data and results merged in a single location.
How does Apache Falcon work? Can you describe to us major components and their function?
Apache Falcon has three basic entities that it operates with:
- firstly cluster, which represent physical infrastructure,
- secondly feed which represents a logical data element with periodicity and,
- thirdly process, which represent a logical processing element that may depend on one or more data elements and may produce one or more data elements.
At the core of it Apache Falcon maintains two graphs, (1) one is the entity graph and (2) the other is instance graph. The entity graph captures the dependencies between the entities and is maintained in an in-memory structure. The instance graph on the other hand maintains information about instances that have been processed and their dependencies. This is stored on a blueprint compatible graph database.
Falcon system integrates with Apache Oozie for all its orchestration requirements, be it a data life-cycle function or process execution. While control remains with Oozie for all the workflows and its execution, Falcon injects pre and post processing hooks into the flows, allowing Falcon to learn about the execution and completion status of each workflow executed by Oozie. Post processing hook essentially sends a notification via JMS to the Falcon server, which then uses this control signal for supporting other features such as Re-runs and CDC (change data capture).
Falcon has two modes of deployment, embedded and distributed. In Embedded model the Falcon server is complete by itself and provides all the capabilities. In distributed mode the Prism, a lightweight proxy takes center stage and provides a veneer over multiple Falcon instances, which may be run in different geographies. The prism ensures that the falcon entities are in sync across the Falcon server and provides a global view of everything happening on the system.
|Falcon Distributed Mode|
Falcon system has a REST based interface and a CLI over it. It integrates with Kerberos for providing authentication and minimal authorization capabilities.