Skip to main content

Governance in a data lake

The need for defining a robust data governance layer is becoming an essential requirement for an enterprise data lake. Continuing our discussion on data governance, we focus on Apache Falcon as a solution option for governing the data pipelines. HadoopSphere discussed with Srikanth Sundarrajan, VP of Apache Falcon, about the product as well as the data governance requirements. In the first part of the interview, we talked about Falcon's architecture. We further discuss the functional aspects in the interaction below. 

What lies ahead on the roadmap of Apache Falcon for 2015?

Major focus areas for Apache Falcon in 2015 and beyond:
Entity management and Instance administration dashboard – Currently CLI based administration is very limiting and the real power of the dependency information available within Falcon can’t be unlocked without an appropriate visual interface. Also entity management complexities can be cut down through a friendlier UI.
Recipes – Today Falcon supports notion of a process to perform some action over data. But there are standard and routine operations that may be applicable for a wide range of users. Falcon project is currently working on enabling this through the notion of recipe. This will enable users to convert their standard data routines into templates for reuse and more importantly some common templates can be shared across users/organizations.
Life cycle – Falcon supports standard data management functions off the shelf, however the same doesn’t cater to every user’s requirement and might require customization. Falcon team is currently working on opening this up and allowing this to be customized per deployment to cater to specific needs of a user.
Operational simplification – When Falcon becomes the de-facto platform (as is the case with some of the users), the richness of dependency information contained can be leveraged to operationally simplify how data processing is managed. Today handling infrastructure outage/maintenance or degradation or application failures can stall large pipelines causing cascading issues. Dependency information in Falcon can be used to seamlessly recover from these without any manual intervention.
Pipeline designer – This is a forward-looking capability in Falcon that enables big data ETL pipelines to be authored visually. This would generate code in language such as Apache Pig and wrap them in appropriate Falcon process and define appropriate feeds.

Can you elaborate on key desired components of big data governance regardless of tool capabilities at this stage?

Security, Quality, Provenance and Privacy are fundamental when it comes to data governance
Quality – Quality of data is one of the most critical components and there has to be convenient ways to both audit the system for data quality and also build proactive mechanism to cut out any sources of inaccuracies
Provenance – Organizations typically have complex data flows and often times it is challenging to figure the lineage of this data. To be able to get this lineage at a dataset level, field level and at a record level (in that order of importance) is very important.
Security – This is fundamental and hygiene to any data system. Authentication, Authorization and Audit trail are non-negotiable. Every user has to be authenticated and all access to data is to be authorized and audited.
Privacy – Data anonymization is one of the key techniques to conform to laws and regulation of the land. This is something that the data systems have to natively support or enable.

Why would an enterprise not prefer to use commercial tools (like Informatica) and rather use open source Apache Falcon?

Apache Falcon is a Hadoop first data management system and integrates well with standard components in the big data open source eco systems that are widely adopted. This native integration with Hadoop is what makes it a tool of choice. Apache Falcon being available under liberal APL 2.0 license and housed under ASF allows users/organizations to experiment with it easily and also enable them to contribute their extensions. Recent elevation of Apache Falcon to a top-level project also assures the users about the community driven development process adopted within the Falcon project.

If someone is using Cloudera distribution, what are the options for him?

Apache Falcon is distribution agnostic and should work (with some minor tweaks) for anyone using Apache Hadoop 2.5.0 and above along with Oozie 4.1.0.  There are plenty of users who use Apache Falcon along with HDP. One of the largest users of Apache Falcon has used it along with CDH 3 and CDH 4, and there are some users who have tried using Apache Falcon with MapR distribution as well.


Srikanth Sundarrajan works at Inmobi Technology Services, helping architect and build their next generation data management system. He is one of the key contributors to Apache Falcon and currently VP of the project. He has been involved in various projects under the Apache Hadoop umbrella including Apache Lens, Apache Hadoop-core, and Apache Oozie. He has been working with distributed processing systems for over a decade and Hadoop in particular over the last 7 years. Srikanth holds a graduate degree in Computer Engineering from University of Southern California.


Comments

Popular posts from this blog

In-memory data model with Apache Gora

Open source in-memory data model and persistence for big data framework Apache Gora™ version 0.3, was released in May 2013. The 0.3 release offers significant improvements and changes to a number of modules including a number of bug fixes. However, what may be of significant interest to the DynamoDB community will be the addition of a gora-dynamodb datastore for mapping and persisting objects to Amazon's DynamoDB. Additionally the release includes various improvements to the gora-core and gora-cassandra modules as well as a new Web Services API implementation which enables users to extend Gora to any cloud storage platform of their choice. This 2-part post provides commentary on all of the above and a whole lot more, expanding to cover where Gora fits in within the NoSQL and Big Data space, the development challenges and features which have been baked into Gora 0.3 and finally what we have on the road map for the 0.4 development drive.
Introducing Apache Gora Although there are var…

Top Big Data Influencers of 2015

2015 was an exciting year for big data and hadoop ecosystem. We saw hadoop becoming an essential part of data management strategy of almost all major enterprise organizations. There is cut throat competition among IT vendors now to help realize the vision of data hub, data lake and data warehouse with Hadoop and Spark.
As part of its annual assessment of big data and hadoop ecosystem, HadoopSphere publishes a list of top big data influencers each year. The list is derived based on a scientific methodology which involves assessing various parameters in each category of influencers. HadoopSphere Top Big Data Influencers list reflects the people, products, organizations and portals that exercised the most influence on big data and ecosystem in a particular year. The influencers have been listed in the following categories:

AnalystsSocial MediaOnline MediaProductsTechiesCoachThought LeadersClick here to read the methodology used.

Analysts:Doug HenschenIt might have been hard to miss Doug…

5 online tools in data visualization playground

While building up an analytics dashboard, one of the major decision points is regarding the type of charts and graphs that would provide better insight into the data. To avoid a lot of re-work later, it makes sense to try the various chart options during the requirement and design phase. It is probably a well known myth that existing tool options in any product can serve all the user requirements with just minor configuration changes. We all know and realize that code needs to be written to serve each customer’s individual needs.
To that effect, here are 5 tools that could empower your technical and business teams to decide on visualization options during the requirement phase. Listed below are online tools for you to add data and use as playground.
1)      Many Eyes: Many Eyes is a data visualization experiment by IBM Researchandthe IBM Cognos software group. This tool provides option to upload data sets and create visualizations including Scatter Plot, Tree Map, Tag/Word cloud and ge…