Skip to main content

Governance in a data lake

The need for defining a robust data governance layer is becoming an essential requirement for an enterprise data lake. Continuing our discussion on data governance, we focus on Apache Falcon as a solution option for governing the data pipelines. HadoopSphere discussed with Srikanth Sundarrajan, VP of Apache Falcon, about the product as well as the data governance requirements. In the first part of the interview, we talked about Falcon's architecture. We further discuss the functional aspects in the interaction below. 

What lies ahead on the roadmap of Apache Falcon for 2015?

Major focus areas for Apache Falcon in 2015 and beyond:
Entity management and Instance administration dashboard – Currently CLI based administration is very limiting and the real power of the dependency information available within Falcon can’t be unlocked without an appropriate visual interface. Also entity management complexities can be cut down through a friendlier UI.
Recipes – Today Falcon supports notion of a process to perform some action over data. But there are standard and routine operations that may be applicable for a wide range of users. Falcon project is currently working on enabling this through the notion of recipe. This will enable users to convert their standard data routines into templates for reuse and more importantly some common templates can be shared across users/organizations.
Life cycle – Falcon supports standard data management functions off the shelf, however the same doesn’t cater to every user’s requirement and might require customization. Falcon team is currently working on opening this up and allowing this to be customized per deployment to cater to specific needs of a user.
Operational simplification – When Falcon becomes the de-facto platform (as is the case with some of the users), the richness of dependency information contained can be leveraged to operationally simplify how data processing is managed. Today handling infrastructure outage/maintenance or degradation or application failures can stall large pipelines causing cascading issues. Dependency information in Falcon can be used to seamlessly recover from these without any manual intervention.
Pipeline designer – This is a forward-looking capability in Falcon that enables big data ETL pipelines to be authored visually. This would generate code in language such as Apache Pig and wrap them in appropriate Falcon process and define appropriate feeds.

Can you elaborate on key desired components of big data governance regardless of tool capabilities at this stage?

Security, Quality, Provenance and Privacy are fundamental when it comes to data governance
Quality – Quality of data is one of the most critical components and there has to be convenient ways to both audit the system for data quality and also build proactive mechanism to cut out any sources of inaccuracies
Provenance – Organizations typically have complex data flows and often times it is challenging to figure the lineage of this data. To be able to get this lineage at a dataset level, field level and at a record level (in that order of importance) is very important.
Security – This is fundamental and hygiene to any data system. Authentication, Authorization and Audit trail are non-negotiable. Every user has to be authenticated and all access to data is to be authorized and audited.
Privacy – Data anonymization is one of the key techniques to conform to laws and regulation of the land. This is something that the data systems have to natively support or enable.

Why would an enterprise not prefer to use commercial tools (like Informatica) and rather use open source Apache Falcon?

Apache Falcon is a Hadoop first data management system and integrates well with standard components in the big data open source eco systems that are widely adopted. This native integration with Hadoop is what makes it a tool of choice. Apache Falcon being available under liberal APL 2.0 license and housed under ASF allows users/organizations to experiment with it easily and also enable them to contribute their extensions. Recent elevation of Apache Falcon to a top-level project also assures the users about the community driven development process adopted within the Falcon project.

If someone is using Cloudera distribution, what are the options for him?

Apache Falcon is distribution agnostic and should work (with some minor tweaks) for anyone using Apache Hadoop 2.5.0 and above along with Oozie 4.1.0.  There are plenty of users who use Apache Falcon along with HDP. One of the largest users of Apache Falcon has used it along with CDH 3 and CDH 4, and there are some users who have tried using Apache Falcon with MapR distribution as well.

Srikanth Sundarrajan works at Inmobi Technology Services, helping architect and build their next generation data management system. He is one of the key contributors to Apache Falcon and currently VP of the project. He has been involved in various projects under the Apache Hadoop umbrella including Apache Lens, Apache Hadoop-core, and Apache Oozie. He has been working with distributed processing systems for over a decade and Hadoop in particular over the last 7 years. Srikanth holds a graduate degree in Computer Engineering from University of Southern California.


Popular posts from this blog

Data deduplication tactics with HDFS and MapReduce

As the amount of data continues to grow exponentially, there has been increased focus on stored data reduction methods. Data compression, single instance store and data deduplication are among the common techniques employed for stored data reduction.
Deduplication often refers to elimination of redundant subfiles (also known as chunks, blocks, or extents). Unlike compression, data is not changed and eliminates storage capacity for identical data. Data deduplication offers significant advantage in terms of reduction in storage, network bandwidth and promises increased scalability.
From a simplistic use case perspective, we can see application in removing duplicates in Call Detail Record (CDR) for a Telecom carrier. Similarly, we may apply the technique to optimize on network traffic carrying the same data packets.
Some of the common methods for data deduplication in storage architecture include hashing, binary comparison and delta differencing. In this post, we focus on how MapReduce and…

5 online tools in data visualization playground

While building up an analytics dashboard, one of the major decision points is regarding the type of charts and graphs that would provide better insight into the data. To avoid a lot of re-work later, it makes sense to try the various chart options during the requirement and design phase. It is probably a well known myth that existing tool options in any product can serve all the user requirements with just minor configuration changes. We all know and realize that code needs to be written to serve each customer’s individual needs.
To that effect, here are 5 tools that could empower your technical and business teams to decide on visualization options during the requirement phase. Listed below are online tools for you to add data and use as playground.
1)      Many Eyes: Many Eyes is a data visualization experiment by IBM Researchandthe IBM Cognos software group. This tool provides option to upload data sets and create visualizations including Scatter Plot, Tree Map, Tag/Word cloud and ge…

Pricing models for Hadoop products

A look at the various pricing models adopted by the vendors in the Hadoop ecosystem. While the pricing models are evolving in this rapid and dynamic market, listed below are some of the major variations utilized by companies in the sphere.
1) Per Node:Among the most common model, the node based pricing mechanism utilizes customized rules for determining pricing per node. This may be as straight forward as pricing per name node and data node or could have complex variants of pricing based on number of core processors utilized by the nodes in the cluster or per user license in case of applications.
2) Per TB:The data based pricing mechanism charges customer for license cost per TB of data. This model usually accounts non replicated data for computation of cost.
3) Subscription Support cost only:In this model, the vendor prefers to give away software for free but charges the customer for subscription support on a specified number of nodes. The support timings and level of support further …