Skip to main content

Governance in a data lake

The need for defining a robust data governance layer is becoming an essential requirement for an enterprise data lake. Continuing our discussion on data governance, we focus on Apache Falcon as a solution option for governing the data pipelines. HadoopSphere discussed with Srikanth Sundarrajan, VP of Apache Falcon, about the product as well as the data governance requirements. In the first part of the interview, we talked about Falcon's architecture. We further discuss the functional aspects in the interaction below. 

What lies ahead on the roadmap of Apache Falcon for 2015?

Major focus areas for Apache Falcon in 2015 and beyond:
Entity management and Instance administration dashboard – Currently CLI based administration is very limiting and the real power of the dependency information available within Falcon can’t be unlocked without an appropriate visual interface. Also entity management complexities can be cut down through a friendlier UI.
Recipes – Today Falcon supports notion of a process to perform some action over data. But there are standard and routine operations that may be applicable for a wide range of users. Falcon project is currently working on enabling this through the notion of recipe. This will enable users to convert their standard data routines into templates for reuse and more importantly some common templates can be shared across users/organizations.
Life cycle – Falcon supports standard data management functions off the shelf, however the same doesn’t cater to every user’s requirement and might require customization. Falcon team is currently working on opening this up and allowing this to be customized per deployment to cater to specific needs of a user.
Operational simplification – When Falcon becomes the de-facto platform (as is the case with some of the users), the richness of dependency information contained can be leveraged to operationally simplify how data processing is managed. Today handling infrastructure outage/maintenance or degradation or application failures can stall large pipelines causing cascading issues. Dependency information in Falcon can be used to seamlessly recover from these without any manual intervention.
Pipeline designer – This is a forward-looking capability in Falcon that enables big data ETL pipelines to be authored visually. This would generate code in language such as Apache Pig and wrap them in appropriate Falcon process and define appropriate feeds.

Can you elaborate on key desired components of big data governance regardless of tool capabilities at this stage?

Security, Quality, Provenance and Privacy are fundamental when it comes to data governance
Quality – Quality of data is one of the most critical components and there has to be convenient ways to both audit the system for data quality and also build proactive mechanism to cut out any sources of inaccuracies
Provenance – Organizations typically have complex data flows and often times it is challenging to figure the lineage of this data. To be able to get this lineage at a dataset level, field level and at a record level (in that order of importance) is very important.
Security – This is fundamental and hygiene to any data system. Authentication, Authorization and Audit trail are non-negotiable. Every user has to be authenticated and all access to data is to be authorized and audited.
Privacy – Data anonymization is one of the key techniques to conform to laws and regulation of the land. This is something that the data systems have to natively support or enable.

Why would an enterprise not prefer to use commercial tools (like Informatica) and rather use open source Apache Falcon?

Apache Falcon is a Hadoop first data management system and integrates well with standard components in the big data open source eco systems that are widely adopted. This native integration with Hadoop is what makes it a tool of choice. Apache Falcon being available under liberal APL 2.0 license and housed under ASF allows users/organizations to experiment with it easily and also enable them to contribute their extensions. Recent elevation of Apache Falcon to a top-level project also assures the users about the community driven development process adopted within the Falcon project.

If someone is using Cloudera distribution, what are the options for him?

Apache Falcon is distribution agnostic and should work (with some minor tweaks) for anyone using Apache Hadoop 2.5.0 and above along with Oozie 4.1.0.  There are plenty of users who use Apache Falcon along with HDP. One of the largest users of Apache Falcon has used it along with CDH 3 and CDH 4, and there are some users who have tried using Apache Falcon with MapR distribution as well.


Srikanth Sundarrajan works at Inmobi Technology Services, helping architect and build their next generation data management system. He is one of the key contributors to Apache Falcon and currently VP of the project. He has been involved in various projects under the Apache Hadoop umbrella including Apache Lens, Apache Hadoop-core, and Apache Oozie. He has been working with distributed processing systems for over a decade and Hadoop in particular over the last 7 years. Srikanth holds a graduate degree in Computer Engineering from University of Southern California.


Comments

Popular posts from this blog

In-memory data model with Apache Gora

Open source in-memory data model and persistence for big data framework Apache Gora™ version 0.3, was released in May 2013. The 0.3 release offers significant improvements and changes to a number of modules including a number of bug fixes. However, what may be of significant interest to the DynamoDB community will be the addition of a gora-dynamodb datastore for mapping and persisting objects to Amazon's DynamoDB. Additionally the release includes various improvements to the gora-core and gora-cassandra modules as well as a new Web Services API implementation which enables users to extend Gora to any cloud storage platform of their choice. This 2-part post provides commentary on all of the above and a whole lot more, expanding to cover where Gora fits in within the NoSQL and Big Data space, the development challenges and features which have been baked into Gora 0.3 and finally what we have on the road map for the 0.4 development drive.
Introducing Apache Gora Although there are var…

Data deduplication tactics with HDFS and MapReduce

As the amount of data continues to grow exponentially, there has been increased focus on stored data reduction methods. Data compression, single instance store and data deduplication are among the common techniques employed for stored data reduction.
Deduplication often refers to elimination of redundant subfiles (also known as chunks, blocks, or extents). Unlike compression, data is not changed and eliminates storage capacity for identical data. Data deduplication offers significant advantage in terms of reduction in storage, network bandwidth and promises increased scalability.
From a simplistic use case perspective, we can see application in removing duplicates in Call Detail Record (CDR) for a Telecom carrier. Similarly, we may apply the technique to optimize on network traffic carrying the same data packets.
Some of the common methods for data deduplication in storage architecture include hashing, binary comparison and delta differencing. In this post, we focus on how MapReduce and…

Pricing models for Hadoop products

A look at the various pricing models adopted by the vendors in the Hadoop ecosystem. While the pricing models are evolving in this rapid and dynamic market, listed below are some of the major variations utilized by companies in the sphere.
1) Per Node:Among the most common model, the node based pricing mechanism utilizes customized rules for determining pricing per node. This may be as straight forward as pricing per name node and data node or could have complex variants of pricing based on number of core processors utilized by the nodes in the cluster or per user license in case of applications.
2) Per TB:The data based pricing mechanism charges customer for license cost per TB of data. This model usually accounts non replicated data for computation of cost.
3) Subscription Support cost only:In this model, the vendor prefers to give away software for free but charges the customer for subscription support on a specified number of nodes. The support timings and level of support further …