Skip to main content

Data pipelines with Apache Falcon

Over the past few months, Apache Falcon has been gaining traction as a data governance engine for defining, scheduling, and monitoring data management policies. Apache Falcon is a feed processing and feed management system aimed at making it easier for Hadoop administrators to define their data pipelines and auto-generate workflows in Apache Oozie.

HadoopSphere interacted with Srikanth Sundarrajan, VP of Apache Falcon, to understand the usage and functional intricacies of the product. This is what Srikanth told us:

What is the objective of Apache Falcon and what use cases does it fit in?


Apache Falcon is a tool focused on simplifying data and pipeline management for large-scale data, particularly stored and processed through Apache Hadoop. Falcon system provides standard data life cycle management functions such as data replication, eviction, archival while also providing strong orchestration capabilities for pipelines. Falcon maintains dependency information between data elements and processing elements. This dependency information can be used to provide data provenance besides simplifying pipeline management.

Today Apache Falcon system is being used widely for addressing the following use cases:
Declaring and managing data retention policy
Data mirroring, replication, DR backup 
Data pipeline orchestration
Tracking data lineage & provenance
ETL for Hadoop

The original version of Apache Falcon was built back in early 2012. InMobi Technologies, one of the largest users of the system, has been using this as a de-facto platform, for managing their ETL pipelines for reporting & analytics, model training, enforcing data retention policies, data archival and large scale data movement across WAN. Another interesting application where Falcon is being used is to run identical pipelines in multiple sites on local data and results merged in a single location.

How does Apache Falcon work? Can you describe to us major components and their function?


Apache Falcon has three basic entities that it operates with:
- firstly cluster, which represent physical infrastructure, 
- secondly feed which represents a logical data element with periodicity and, 
- thirdly process, which represent a logical processing element that may depend on one or more data elements and may produce one or more data elements. 

At the core of it Apache Falcon maintains two graphs, (1) one is the entity graph and (2) the other is instance graph. The entity graph captures the dependencies between the entities and is maintained in an in-memory structure. The instance graph on the other hand maintains information about instances that have been processed and their dependencies. This is stored on a blueprint compatible graph database.
 
Falcon Embedded Mode

Falcon system integrates with Apache Oozie for all its orchestration requirements, be it a data life-cycle function or process execution. While control remains with Oozie for all the workflows and its execution, Falcon injects pre and post processing hooks into the flows, allowing Falcon to learn about the execution and completion status of each workflow executed by Oozie. Post processing hook essentially sends a notification via JMS to the Falcon server, which then uses this control signal for supporting other features such as Re-runs and CDC (change data capture).

Falcon has two modes of deployment, embedded and distributed. In Embedded model the Falcon server is complete by itself and provides all the capabilities. In distributed mode the Prism, a lightweight proxy takes center stage and provides a veneer over multiple Falcon instances, which may be run in different geographies. The prism ensures that the falcon entities are in sync across the Falcon server and provides a global view of everything happening on the system.

Falcon Distributed Mode

Falcon system has a REST based interface and a CLI over it. It integrates with Kerberos for providing authentication and minimal authorization capabilities.



Srikanth Sundarrajan works at Inmobi Technology Services, helping architect and build their next generation data management system. He is one of the key contributors to Apache Falcon and currently VP of the project. He has been involved in various projects under the Apache Hadoop umbrella including Apache Lens, Apache Hadoop-core, and Apache Oozie. He has been working with distributed processing systems for over a decade and Hadoop in particular over the last 7 years. Srikanth holds a graduate degree in Computer Engineering from University of Southern California.


Comments

  1. Hello,

    Concerning feature : • Data mirroring, replication, DR backup

    It is possible between to hadoop platform clusters or just in the same hadoop cluster ?

    ReplyDelete

Post a Comment

Popular posts from this blog

Data deduplication tactics with HDFS and MapReduce

As the amount of data continues to grow exponentially, there has been increased focus on stored data reduction methods. Data compression, single instance store and data deduplication are among the common techniques employed for stored data reduction.
Deduplication often refers to elimination of redundant subfiles (also known as chunks, blocks, or extents). Unlike compression, data is not changed and eliminates storage capacity for identical data. Data deduplication offers significant advantage in terms of reduction in storage, network bandwidth and promises increased scalability.
From a simplistic use case perspective, we can see application in removing duplicates in Call Detail Record (CDR) for a Telecom carrier. Similarly, we may apply the technique to optimize on network traffic carrying the same data packets.
Some of the common methods for data deduplication in storage architecture include hashing, binary comparison and delta differencing. In this post, we focus on how MapReduce and…

Pricing models for Hadoop products

A look at the various pricing models adopted by the vendors in the Hadoop ecosystem. While the pricing models are evolving in this rapid and dynamic market, listed below are some of the major variations utilized by companies in the sphere.
1) Per Node:Among the most common model, the node based pricing mechanism utilizes customized rules for determining pricing per node. This may be as straight forward as pricing per name node and data node or could have complex variants of pricing based on number of core processors utilized by the nodes in the cluster or per user license in case of applications.
2) Per TB:The data based pricing mechanism charges customer for license cost per TB of data. This model usually accounts non replicated data for computation of cost.
3) Subscription Support cost only:In this model, the vendor prefers to give away software for free but charges the customer for subscription support on a specified number of nodes. The support timings and level of support further …

5 online tools in data visualization playground

While building up an analytics dashboard, one of the major decision points is regarding the type of charts and graphs that would provide better insight into the data. To avoid a lot of re-work later, it makes sense to try the various chart options during the requirement and design phase. It is probably a well known myth that existing tool options in any product can serve all the user requirements with just minor configuration changes. We all know and realize that code needs to be written to serve each customer’s individual needs.
To that effect, here are 5 tools that could empower your technical and business teams to decide on visualization options during the requirement phase. Listed below are online tools for you to add data and use as playground.
1)      Many Eyes: Many Eyes is a data visualization experiment by IBM Researchandthe IBM Cognos software group. This tool provides option to upload data sets and create visualizations including Scatter Plot, Tree Map, Tag/Word cloud and ge…