Skip to main content

Data pipelines with Apache Falcon

Over the past few months, Apache Falcon has been gaining traction as a data governance engine for defining, scheduling, and monitoring data management policies. Apache Falcon is a feed processing and feed management system aimed at making it easier for Hadoop administrators to define their data pipelines and auto-generate workflows in Apache Oozie.

HadoopSphere interacted with Srikanth Sundarrajan, VP of Apache Falcon, to understand the usage and functional intricacies of the product. This is what Srikanth told us:

What is the objective of Apache Falcon and what use cases does it fit in?


Apache Falcon is a tool focused on simplifying data and pipeline management for large-scale data, particularly stored and processed through Apache Hadoop. Falcon system provides standard data life cycle management functions such as data replication, eviction, archival while also providing strong orchestration capabilities for pipelines. Falcon maintains dependency information between data elements and processing elements. This dependency information can be used to provide data provenance besides simplifying pipeline management.

Today Apache Falcon system is being used widely for addressing the following use cases:
Declaring and managing data retention policy
Data mirroring, replication, DR backup 
Data pipeline orchestration
Tracking data lineage & provenance
ETL for Hadoop

The original version of Apache Falcon was built back in early 2012. InMobi Technologies, one of the largest users of the system, has been using this as a de-facto platform, for managing their ETL pipelines for reporting & analytics, model training, enforcing data retention policies, data archival and large scale data movement across WAN. Another interesting application where Falcon is being used is to run identical pipelines in multiple sites on local data and results merged in a single location.

How does Apache Falcon work? Can you describe to us major components and their function?


Apache Falcon has three basic entities that it operates with:
- firstly cluster, which represent physical infrastructure, 
- secondly feed which represents a logical data element with periodicity and, 
- thirdly process, which represent a logical processing element that may depend on one or more data elements and may produce one or more data elements. 

At the core of it Apache Falcon maintains two graphs, (1) one is the entity graph and (2) the other is instance graph. The entity graph captures the dependencies between the entities and is maintained in an in-memory structure. The instance graph on the other hand maintains information about instances that have been processed and their dependencies. This is stored on a blueprint compatible graph database.
 
Falcon Embedded Mode

Falcon system integrates with Apache Oozie for all its orchestration requirements, be it a data life-cycle function or process execution. While control remains with Oozie for all the workflows and its execution, Falcon injects pre and post processing hooks into the flows, allowing Falcon to learn about the execution and completion status of each workflow executed by Oozie. Post processing hook essentially sends a notification via JMS to the Falcon server, which then uses this control signal for supporting other features such as Re-runs and CDC (change data capture).

Falcon has two modes of deployment, embedded and distributed. In Embedded model the Falcon server is complete by itself and provides all the capabilities. In distributed mode the Prism, a lightweight proxy takes center stage and provides a veneer over multiple Falcon instances, which may be run in different geographies. The prism ensures that the falcon entities are in sync across the Falcon server and provides a global view of everything happening on the system.

Falcon Distributed Mode

Falcon system has a REST based interface and a CLI over it. It integrates with Kerberos for providing authentication and minimal authorization capabilities.



Srikanth Sundarrajan works at Inmobi Technology Services, helping architect and build their next generation data management system. He is one of the key contributors to Apache Falcon and currently VP of the project. He has been involved in various projects under the Apache Hadoop umbrella including Apache Lens, Apache Hadoop-core, and Apache Oozie. He has been working with distributed processing systems for over a decade and Hadoop in particular over the last 7 years. Srikanth holds a graduate degree in Computer Engineering from University of Southern California.


Comments

  1. Hello,

    Concerning feature : • Data mirroring, replication, DR backup

    It is possible between to hadoop platform clusters or just in the same hadoop cluster ?

    ReplyDelete

Post a Comment

Popular posts from this blog

Hadoop's 10 in LinkedIn's 10

LinkedIn, the pioneering professional social network has turned 10 years old. One of the hallmarks of its journey has been its technical accomplishments and significant contribution to open source, particularly in the last few years. Hadoop occupies a central place in its technical environment powering some of the most used features of desktop and mobile app. As LinkedIn enters the second decade of its existence, here is a look at 10 major projects and products powered by Hadoop in its data ecosystem.
1)      Voldemort:Arguably, the most famous export of LinkedIn engineering, Voldemort is a distributed key-value storage system. Named after an antagonist in Harry Potter series and influenced by Amazon’s Dynamo DB, the wizardry in this database extends to its self healing features. Available in HA configuration, its layered, pluggable architecture implementations are being used for both read and read-write use cases.
2)      Azkaban:A batch job scheduling system with a friendly UI, Azkab…

Data deduplication tactics with HDFS and MapReduce

As the amount of data continues to grow exponentially, there has been increased focus on stored data reduction methods. Data compression, single instance store and data deduplication are among the common techniques employed for stored data reduction.
Deduplication often refers to elimination of redundant subfiles (also known as chunks, blocks, or extents). Unlike compression, data is not changed and eliminates storage capacity for identical data. Data deduplication offers significant advantage in terms of reduction in storage, network bandwidth and promises increased scalability.
From a simplistic use case perspective, we can see application in removing duplicates in Call Detail Record (CDR) for a Telecom carrier. Similarly, we may apply the technique to optimize on network traffic carrying the same data packets.
Some of the common methods for data deduplication in storage architecture include hashing, binary comparison and delta differencing. In this post, we focus on how MapReduce and…

Top Big Data Influencers of 2015

2015 was an exciting year for big data and hadoop ecosystem. We saw hadoop becoming an essential part of data management strategy of almost all major enterprise organizations. There is cut throat competition among IT vendors now to help realize the vision of data hub, data lake and data warehouse with Hadoop and Spark.
As part of its annual assessment of big data and hadoop ecosystem, HadoopSphere publishes a list of top big data influencers each year. The list is derived based on a scientific methodology which involves assessing various parameters in each category of influencers. HadoopSphere Top Big Data Influencers list reflects the people, products, organizations and portals that exercised the most influence on big data and ecosystem in a particular year. The influencers have been listed in the following categories:

AnalystsSocial MediaOnline MediaProductsTechiesCoachThought LeadersClick here to read the methodology used.

Analysts:Doug HenschenIt might have been hard to miss Doug…