Recent developments...

Large scale graph processing with Apache Hama

Recently Apache Hama team released official 0.7.0 version. According to the release announcement, there were big improvements in Graph package. In this article, we provide an overview of the newly improved Graph package of Apache Hama, and the benchmark results that performed by cloud platform team at Samsung Electronics.


Large scale datasets are being increasingly used in many fields. Graph algorithms are becoming important for analyzing big data. Data scientists are able to predict the behavior of the customer, the trends of the market, and make a decision by analyzing the graph structure and characteristics. Currently there are a variety of open source graph analytic frameworks, such as Google’s Pregel[1], Apache Giraph[2], GraphLab[3] and GraphX[4]. These frameworks are aimed at computations varying from classical graph traversal algorithms to graph statistics calculations such as triangle counting to complex machine learning algorithms. However these frameworks have been developed each offering a solution with different programming models and targeted at different users. In this article, we introduce the Apache Hama[5] and its graph package.

What is Apache Hama?


Apache Hama is a general-purpose Bulk Synchronous Parallel (BSP)[6] computing engine on top of Hadoop. It provides a parallel processing framework for massive scientific and iterative algorithms. BSP is an easy and flexible programming model, as compared with traditional models of Message Passing, as shown in Figure 1.

Figure 1. Bulk Synchronos Parallel (BSP) Model.

Hama performs a series of supersteps based on BSP. A superstep consist of three stages: local computation, message communication, and barrier synchronization. Hama is suitable for iterative computation since it is possible that input data which can be saved in memory is able to transfer between supersteps. However, MapReduce must scan input data in each iteration, and then output data must be saved in file system, such as HDFS. Hence, Hama can solve the problems which MapReduce cannot handle easily.


Graph Package of Apache Hama


Apache Hama also supports a graph package which allows users to program applications for graph-parallel computations[7]. The vertex-centric model is suggestive of MapReduce[8] in that users focus on a local action, processing each item independently, and the system compose these actions to lift computation to a large graph dataset. It is easy to implement and prove to be useful for many graph algorithms.

Figure 2. Finding the maximum value in a graph example. 
Dotted lines are messages. Grey vertices have voted to halt.

Figure 2 illustrates the concept of graph processing in Hama using a simple example: given a connected graph where each vertex contains a value, it propagates the latest value to every vertex. Any vertex that has known a larger value from its messages sends it to all its neighbors in each superstep. When no more vertices change in a superstep, the algorithm terminates. If you want to know graph package in Hama, please refer to Apache Hama Programming[7].


Hama in Action


We performed the performance of the version of 0.7.0 of Hama, which is recently release, on Amazon Web Service Elastic Map Reduce(EMR) instances. PageRank[9] algorithm is used for benchmark test of the graph package. PageRank is an algorithm that is used to rank web pages according to their popularity. This algorithm calculates the probability that a random walk would end in a particular vertex of a graph. This application computes the page rank of every vertex in a directed graph iteratively. At every iteration t, each vertex computes the following:
where r is the probability of the a random jump, E is the set of directed edge in the graph and PRt(j) denotes the page rank of the vertex at iteration t.
We compared the performance of Hama with Giraph which is a graph framework that is used in Facebook. PageRank algorithm was already implemented in both of the frameworks. Default input types in Hama is text type and in Giraph is JSON. We implement input format of JSON to run in Hama for a fair comparison. JSON type format is as follows:

Above datasets are generated using fastgen example program which generates random graph datasets. The way fastgen generate the datasets on Hama cluster is as follows:

% bin/hama jar hama-examples-x.x.x.jar gen fastgen -v 1000 -e 100 -of json -o /randomgraph -t 40

  The following is the meaning of the options:

Experimental Results


In this section, we describes our experimental setup with details about datasets and experimental platform. We conducted various experiments with PageRank algorithm on EMR cluster. We used graph datasets randomly generated using fastgen example in Hama. Also we run benchmarks on EMR cluster which consists of instance type of r3.xlarge[10]. This instance type is memory-optimized instance type which has 30GB of RAM, and 4 vCPU. The cluster run on the hadoop 1.0.3(Amazon Machine Image version 2.4.11). That is because Giraph didn’t work well on hadoop 2.

Hama provides pure BSP model via own cluster. Furthermore, it works on both Hadoop YARN[11] and Apache Mesos[12]. Although Giraph has same programming model like Hama, it works as a MapReduce job which means that it requires MapReduce framework. So Hama performed PageRank algorithm making its own cluster. On the other hands, Giraph perform MapReduce application on hadoop.

For benchmarking of scalability, we increased the number of machines from 5 to 30 on a fixed dataset which has one billion edges (Figure 4). Both frameworks present significant scalability for same datasets. However, Hama has much lower execution time than Giraph for the same data set.


Figure 3. The execution time on same data set, depending on the size of machines.

From the computing performance’s point of view, Figure 4 shows execution time on same nodes, depending on the size of dataset. Hama also shows more powerful performance than Giraph for the same machines.

The main reason that the result of benchmark shows different performance in spite of the same programming model is that we use the advanced PageRank algorithm which uses aggregators for detecting the convergence condition and the BSP framework’s efficient messaging system. Hama uses own outgoing/incoming message manager instead of Java's built-in queues. It stores messages in serialized form in a set of bundles (or a single bundle) to reduce the memory usage and RPC overhead. Also unsafe serialization is used to serialize Vertex and its message objects more quickly. Instead of sending each message individually, Hama packages the messages per vertex at once and sends a packaged message to their assigned destination nodes. With this Hama v0.7 achieved significant improvement in the performance of graph applications.


Figure 4. The execution time on same machine, depending on the size of dataset.

Conclusion and future work


In this article, we presented a graph package of Hama. We also performed the performance of graph package, with Hama. The performance compared with Giraph, in respect of computing and scalability. As a result, the performance and scalability are already satisfactory for graphs with billions of vertices.

However, there are also a lot of improvement to be done based on the current version. Efficient load balancing, spillable vertices storage and message serialization are challenging issues. We look forward to add these features and see our community growth.

-

References

[1] G. Malewicz, M. Austern, A. Bik, J. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. Principles Of Distributed Computing (PODC), 2009.
[2] Apache Giraph v 1.1.0 http://giraph.apache.org/
[3] GraphLab https://dato.com/products/create/open_source.html
[4] Apache Spark’s GraphX https://spark.apache.org/graphx/
[5] Apache Hama v 0.7.0. http://hama.apache.org/
[6] Leslie G. Valiant, A bridging model for parallel computation, Communications of the ACM, Volume 33 Issue 8, Aug. 1990
[7] Apache Hama BSP programming, http://people.apache.org/~tjungblut/downloads/hamadocs/
ApacheHamaBSPProgrammingmodel_06.pdf
[8] Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of the ACM 51.1 (2008): 107-113.
[9] Page, Larry, et al. PageRank: Bringing order to the web. Vol. 72. Stanford Digital Libraries Working Paper, 1997.
[10] Amazon EC2 instances, http://aws.amazon.com/ec2/instance-types/
[11] Apache Hadoop YARN arhitecture, https://hadoop.apache.org/docs/stable/hadoop-yarn/
hadoop-yarn-site/YARN.html
[12] Apache Mesos, http://mesos.apache.org/
[13] Satish, Nadathur, et al. "Navigating the maze of graph analytics frameworks using massive graph datasets." Proceedings of the 2014 ACM SIGMOD international conference on Management of data. ACM, 2014.
[14] General dynamics, High Performance Data Analytics, http://gdmissionsystems.com/wp-content/uploads/2015/05/GDMS_White_Paper_20151.pdf


About the author:

Minho Kim is a software engineer at Samsung Electronics in the Cloud Technology Lab. Minho is a committer of Apache Hama project, which is a general BSP computing engine. He has been developing and contributing to Apache Hama. Minho’s research interests include deep learning such as DNN, and CNN.





Pre-register now for HadoopSphere Virtual Conclave 
for a chance to get a free invite to 
best classes and sessions on 
Spark, Tajo, Flink and more...
Read more »

Low latency SQL querying on HBase

HBase has emerged as one of the most popular NoSQL database offering distributed, versioned, non-relational tables hosted on commodity hardware. However, with a large set of users coming from a relational SQL world, it made sense to bring the SQL back in this NoSQL. With Apache Phoenix, database professionals get a convenient way to query HBase through SQL in a fast and efficient manner. Continuing our discussion with James Taylor, the founder of Apache Phoenix, we focus on the functional aspects of Phoenix in this second part of interaction.

Although Apache Phoenix started off with distinct low latency advantage, have the other options like Hive/Impala (integrated with HBase) caught up in terms of performance?


No, these other tools such as Hive and Impala have not invested in improving performance against HBase data, so if anything, Phoenix's advantage has only gotten bigger as our performance improves. 
See this link for comparison of Apache Phoenix with Apache Hive and Cloudera Impala.

Apache Phoenix and Cloudera Impala comparison
(Query: select count(1) from table over 1M and 5M rows)

What lies ahead on roadmap for Apache Phoenix in 2015?


Our upcoming 4.4 release introduces a number of new features:  User Defined Functions, UNION ALL support, Spark integration, Query Server to support thin (and eventually non Java) clients, Pherf tool for testing at scale, MR-based index population, and support for HBase 1.0.

We are also actively working on transaction support by integrating with Tephra (http://tephra.io/). If all goes according to plan, we'll release this after our 4.4 release (in 4.5 or 5.0), as this work is pretty far along (check out our txn branch to play around with it).

In parallel with this, we're working on Apache Calcite integration to improve interop with the greater Hadoop ecosystem through plugging into a rich cost-based optimizer framework. IMHO, this is the answer to ubiquitous usage of Phoenix for HBase data across queries that get data from any other Calcite adapter source (RDBMS, Hive, Drill, Kylin, etc.). This will allow a kind of plug and play approach with this the push down being decided based on a common cost model that all these other tools plug into.

Come hear more and see a demo at our upcoming Meetups or at HBaseCon 2015. 


Does Apache Phoenix also talk to HCatalog or is that interaction left off to HBase itself?


Phoenix manages its metadata through a series of internal HBase tables. It has no interaction with HCatalog.


Can Apache Phoenix be connected with BI tools which have traditionally relied on ODBC drivers?


Phoenix can connect with BI tools that support a JDBC driver. However, BI tools that rely on an ODBC driver are more challenging. There's a new thin driver plus query server model that we support in our upcoming 4.4 release which will help, though. This thin driver will open the door for an ODBC driver to be achievable by writing the same protocol that our Java-based thin driver use (JSON over http).


Which commercial distributions is Apache Phoenix part of? 


Apache Phoenix is available in the Hortonworks HDP distribution. Make sure to let your vendor of choice know that you'd like to see Phoenix included in their distribution as well, as that's what will make it happen.


James Taylor is an architect at salesforce.com in the Big Data Group. He founded the Apache Phoenix project and leads its on-going development efforts. Prior to Salesforce, James worked at BEA Systems on projects such as a federated query processing system and a SQL-based complex event programming platform, and has worked in the computer industry for the past 20+ years at various start-ups. He lives with his wife and two daughters in San Francisco.

Read more »

SQL on HBase with Apache Phoenix

Apache Phoenix is a powerful tool for implementing a relational layer on top of NoSQL HBase database. Low latency query model combined with SQL give the tool a strong edge over HBase API or shell commands. To discover more about the tool and find out how it works, HadoopSphere interacted with James Taylor, founder of Apache Phoenix.

How does Apache Phoenix fit in the crowded SQL on Hadoop space?


Apache Phoenix is specifically meant for accessing HBase data. It's not trying to compete with the Hive and Impalas of the world. We do very specific optimizations for HBase to ensure you get the best possible performance.

What are the complexities that Apache Phoenix abstracts away from the user?


Phoenix abstracts away the same complexities that a RDBMS does when you use SQL versus reading raw bytes from files that represent a page of data on disk. Sure, this is a bit of an exaggeration, but not by a lot. The HBase client APIs are very low level, at the level of byte arrays. There's no support for composite row keys in HBase, for example, so your left bit twiddling the start and stop row in your scans. Performance in HBase changes with each version, so in prior versions you'd get a 25% boost by avoiding the use of the ExplicitColumnTracker.

Above and beyond these lower level details are higher level features such as secondary indexes (now supporting functional indexes), user defined functions (available in 4.4), and just the overall capabilities for querying in SQL (with a parallel execution) versus writing lots and lots of Java code (with a serial execution). Try writing a N-way join with aggregation and union in the HBase client API. And then compare the performance. And then change your data model and see which one takes longer to adapt.
Apache Phoenix architecture

Can you tell us about internal architecture of Apache Phoenix? How does it work?


Apache Phoenix takes your SQL query, compiles it into a series of HBase scans, and orchestrates the running of those scans to produce regular JDBC result sets. The table metadata is stored in an HBase table and versioned, such that snapshot queries over prior versions will automatically use the correct schema. Phoenix has a pretty typical query engine architecture with a parser, compiler, planner, optimizer, and execution engine. We push as much computation as possible into the HBase server which provides plenty of hooks for us to leverage. This helps us achieve our stellar performance.

Taking an example, can you help us understand Apache Phoenix's execution plan?


Take a look at the video from my HBaseCon presentation here  to get a good overview (about 15mins in). Other good references can be found here on Apache Phoenix website.


James Taylor is an architect at salesforce.com in the Big Data Group. He founded the Apache Phoenix project and leads its on-going development efforts. Prior to Salesforce, James worked at BEA Systems on projects such as a federated query processing system and a SQL-based complex event programming platform, and has worked in the computer industry for the past 20+ years at various start-ups. He lives with his wife and two daughters in San Francisco.


Read more »

Exploring varied data stores with Apache MetaModel

With the wide prevalence of multiple data stores including relational, NoSQL and unstructured formats, it becomes natural to look for a library which can act as a common exploration and querying connector. Apache MetaModel is one such library which aims to provide a common interface for discovery, metadata exploration and querying of different types of data sources. With MetaModel, the user can currently query CouchDB, MongoDB, HBase, Cassandra, MySQL, Oracle DB and SQL Server among others.

"MetaModel is a library that encapsulates the differences and enhances the capabilities of different data stores. Rich querying abilities are offered to data stores that do not otherwise support advanced querying and a unified view of the data store structure is offered through a single model of the schema, tables, columns and relationships."


HadoopSphere discussed with Kasper Sorensen, VP of Apache MetaModel on the functional fitment and features of Apache MetaModel. Here is what Kasper had to say about this interesting product.

Please help us understand the purpose of Apache MetaModel and which use cases can Apache MetaModel fit in?


MetaModel was designed to make connectivity across very different types of databases, data file formats and likewise possible with just one uniform approach. We wanted an API where you can explore and query a data source using exactly the same codebase regardless of the source being an Excel spreadsheet, a relational database or something completely different.

There are quite a lot of frameworks out there that do this. But they more or less have the requirement that you need to map your source to some domain model and then you end up actually querying the domain model, not the data source itself. So if your user is the guy that is interested in the Excel sheet or the database table itself then he cannot directly relate his data with what he is getting from the framework. This is why we used the name ‘MetaModel’ – we present data based on a metadata model of the source, not based on a domain model mapping approach.

How does Apache MetaModel work?


It is important for us that MetaModel have very little infrastructure requirements - so you don’t have to use any particular container type or dependency injection framework. MetaModel is plain java oriented and if you want to use it you just instantiate objects, call methods etc.

Whenever you want to interact with data in MetaModel you need an object that implements the DataContext interface. A DataContext is a bit like a “Connection” to a database. The DataContext exposes methods to explore the metadata (schemas, tables etc.) as well as to query the actual data. The Query API is preferably type-safe Java, linked to the metadata objects, but you can also express your query as SQL if you like.

Depending on the type of DataContext, we have a few different ways of working. For SQL databases we of course want to delegate more or less the whole query to the database. On other source types we can only delegate some parts of the query to the underlying database. For instance, if you apply a GROUP BY operation to a NoSQL database, then we usually have to do the grouping part ourselves. For that we have a pluggable query engine. Finally some source types, such as CSV files or XML documents, do not have a query engine already and we wrap our own query engine around them.

Some sources can also be updated. We offer a functional interface where you pass a UpdateScript object that does required update, when it is possible according to the underlying source and with the transactional features that it may or may not support.

Give us a snapshot of what is Apache MetaModel competing with - both in open source and commercial ecosystem?


I don’t think there’s anything out there that is really a LOT like MetaModel. But there are obviously some typical frameworks that has a hint of the same.

JPA, Hibernate and so on are similar in the way that they are essentially abstracting away the underlying storage technology. But they are very different in the sense that they are modelled around a domain model, not the data source itself.

LINQ (for .NET) has a lot of similarities with MetaModel. Obviously the platform is different though and the syntax of LINQ is superior to anything we can achieve as being “just” a library. On the plus-side for MetaModel, I believe we have one of the easiest interfaces to implement if you want to make your own adaptor.

What lies ahead on the roadmap for Apache MetaModel in 2015?


We are in a period of taking small steps so that we get a feel of what the community wants. For example we just made a release where we added write/update support for our ElasticSearch module.

So the long-term roadmap is not really set in stone. But we do always want to expand the portfolio of supported data stores. I personally also would like to see MetaModel used in a few other Apache projects so maybe we need to work outside of our own community, engaging with others as well.

Why would a user explore metadata using Apache MetaModel and not connect to various data stores directly?


If you only need to connect to one data store, which already has a query engine and all - then you don’t have to use MetaModel. A key strength in MetaModel is the uniformed access to multiple data stores and similarly a weakness is in utilizing all the functionality of a source. We do have great metrics overall, but if you’re chasing to optimize the use of just one source then you can typically receive better results by going directly to it.

Another reason might be to get a query API for things such as CSV files, Spreadsheets and so on, which normally have no query capabilities. MetaModel will provide you with a convenient shortcut there.

Also testability is a strong point of MetaModel. You may write code to interact with your database, but simply test it using a POJO data store (an in-memory Java collection structure) which is fast and light-weight.

It seems Apache MetaModel does not support HDFS though it supports HBase. Any specific reason for that?


Not really, except it wasn’t requested by the community yet. But we do have interfaces that quite easily let you use e.g. the CsvDataContext on a resource (file) in HDFS. In fact a colleague of mine did that already for a separate project where a MetaModel-based application was applied to Hadoop.

My guess to answer “why” this is less interesting is that if you’re a HDFS user then you typically have such an amount of data that you anyways don’t want to use a querying framework (as MetaModel) but rather want a processing framework (such as MapReduce or so) in order to handle it.

Does Apache MetaModel support polyglot operations with multiple data stores?


Yes, a common thing that our users ask is stuff like “I have a CSV file which contains keys that are also represented in my database – can I join them?” … And with MetaModel you absolutely can. We offer a class called CompositeDataContext which basically lets you query (and explore for that matter) multiple data stores as if they were one.



Kasper Sorensen is the VP of Apache MetaModel and in his daily life he works as Principal Tech Lead at Human Inference, a Neopost company. The main products in his portfolio are Apache MetaModel and the open source Data Quality toolkit DataCleaner.
Read more »

Governance in a data lake

The need for defining a robust data governance layer is becoming an essential requirement for an enterprise data lake. Continuing our discussion on data governance, we focus on Apache Falcon as a solution option for governing the data pipelines. HadoopSphere discussed with Srikanth Sundarrajan, VP of Apache Falcon, about the product as well as the data governance requirements. In the first part of the interview, we talked about Falcon's architecture. We further discuss the functional aspects in the interaction below. 

What lies ahead on the roadmap of Apache Falcon for 2015?

Major focus areas for Apache Falcon in 2015 and beyond:
Entity management and Instance administration dashboard – Currently CLI based administration is very limiting and the real power of the dependency information available within Falcon can’t be unlocked without an appropriate visual interface. Also entity management complexities can be cut down through a friendlier UI.
Recipes – Today Falcon supports notion of a process to perform some action over data. But there are standard and routine operations that may be applicable for a wide range of users. Falcon project is currently working on enabling this through the notion of recipe. This will enable users to convert their standard data routines into templates for reuse and more importantly some common templates can be shared across users/organizations.
Life cycle – Falcon supports standard data management functions off the shelf, however the same doesn’t cater to every user’s requirement and might require customization. Falcon team is currently working on opening this up and allowing this to be customized per deployment to cater to specific needs of a user.
Operational simplification – When Falcon becomes the de-facto platform (as is the case with some of the users), the richness of dependency information contained can be leveraged to operationally simplify how data processing is managed. Today handling infrastructure outage/maintenance or degradation or application failures can stall large pipelines causing cascading issues. Dependency information in Falcon can be used to seamlessly recover from these without any manual intervention.
Pipeline designer – This is a forward-looking capability in Falcon that enables big data ETL pipelines to be authored visually. This would generate code in language such as Apache Pig and wrap them in appropriate Falcon process and define appropriate feeds.

Can you elaborate on key desired components of big data governance regardless of tool capabilities at this stage?

Security, Quality, Provenance and Privacy are fundamental when it comes to data governance
Quality – Quality of data is one of the most critical components and there has to be convenient ways to both audit the system for data quality and also build proactive mechanism to cut out any sources of inaccuracies
Provenance – Organizations typically have complex data flows and often times it is challenging to figure the lineage of this data. To be able to get this lineage at a dataset level, field level and at a record level (in that order of importance) is very important.
Security – This is fundamental and hygiene to any data system. Authentication, Authorization and Audit trail are non-negotiable. Every user has to be authenticated and all access to data is to be authorized and audited.
Privacy – Data anonymization is one of the key techniques to conform to laws and regulation of the land. This is something that the data systems have to natively support or enable.

Why would an enterprise not prefer to use commercial tools (like Informatica) and rather use open source Apache Falcon?

Apache Falcon is a Hadoop first data management system and integrates well with standard components in the big data open source eco systems that are widely adopted. This native integration with Hadoop is what makes it a tool of choice. Apache Falcon being available under liberal APL 2.0 license and housed under ASF allows users/organizations to experiment with it easily and also enable them to contribute their extensions. Recent elevation of Apache Falcon to a top-level project also assures the users about the community driven development process adopted within the Falcon project.

If someone is using Cloudera distribution, what are the options for him?

Apache Falcon is distribution agnostic and should work (with some minor tweaks) for anyone using Apache Hadoop 2.5.0 and above along with Oozie 4.1.0.  There are plenty of users who use Apache Falcon along with HDP. One of the largest users of Apache Falcon has used it along with CDH 3 and CDH 4, and there are some users who have tried using Apache Falcon with MapR distribution as well.


Srikanth Sundarrajan works at Inmobi Technology Services, helping architect and build their next generation data management system. He is one of the key contributors to Apache Falcon and currently VP of the project. He has been involved in various projects under the Apache Hadoop umbrella including Apache Lens, Apache Hadoop-core, and Apache Oozie. He has been working with distributed processing systems for over a decade and Hadoop in particular over the last 7 years. Srikanth holds a graduate degree in Computer Engineering from University of Southern California.


Read more »

Data pipelines with Apache Falcon

Over the past few months, Apache Falcon has been gaining traction as a data governance engine for defining, scheduling, and monitoring data management policies. Apache Falcon is a feed processing and feed management system aimed at making it easier for Hadoop administrators to define their data pipelines and auto-generate workflows in Apache Oozie.

HadoopSphere interacted with Srikanth Sundarrajan, VP of Apache Falcon, to understand the usage and functional intricacies of the product. This is what Srikanth told us:

What is the objective of Apache Falcon and what use cases does it fit in?


Apache Falcon is a tool focused on simplifying data and pipeline management for large-scale data, particularly stored and processed through Apache Hadoop. Falcon system provides standard data life cycle management functions such as data replication, eviction, archival while also providing strong orchestration capabilities for pipelines. Falcon maintains dependency information between data elements and processing elements. This dependency information can be used to provide data provenance besides simplifying pipeline management.

Today Apache Falcon system is being used widely for addressing the following use cases:
Declaring and managing data retention policy
Data mirroring, replication, DR backup 
Data pipeline orchestration
Tracking data lineage & provenance
ETL for Hadoop

The original version of Apache Falcon was built back in early 2012. InMobi Technologies, one of the largest users of the system, has been using this as a de-facto platform, for managing their ETL pipelines for reporting & analytics, model training, enforcing data retention policies, data archival and large scale data movement across WAN. Another interesting application where Falcon is being used is to run identical pipelines in multiple sites on local data and results merged in a single location.

How does Apache Falcon work? Can you describe to us major components and their function?


Apache Falcon has three basic entities that it operates with:
- firstly cluster, which represent physical infrastructure, 
- secondly feed which represents a logical data element with periodicity and, 
- thirdly process, which represent a logical processing element that may depend on one or more data elements and may produce one or more data elements. 

At the core of it Apache Falcon maintains two graphs, (1) one is the entity graph and (2) the other is instance graph. The entity graph captures the dependencies between the entities and is maintained in an in-memory structure. The instance graph on the other hand maintains information about instances that have been processed and their dependencies. This is stored on a blueprint compatible graph database.
 
Falcon Embedded Mode

Falcon system integrates with Apache Oozie for all its orchestration requirements, be it a data life-cycle function or process execution. While control remains with Oozie for all the workflows and its execution, Falcon injects pre and post processing hooks into the flows, allowing Falcon to learn about the execution and completion status of each workflow executed by Oozie. Post processing hook essentially sends a notification via JMS to the Falcon server, which then uses this control signal for supporting other features such as Re-runs and CDC (change data capture).

Falcon has two modes of deployment, embedded and distributed. In Embedded model the Falcon server is complete by itself and provides all the capabilities. In distributed mode the Prism, a lightweight proxy takes center stage and provides a veneer over multiple Falcon instances, which may be run in different geographies. The prism ensures that the falcon entities are in sync across the Falcon server and provides a global view of everything happening on the system.

Falcon Distributed Mode

Falcon system has a REST based interface and a CLI over it. It integrates with Kerberos for providing authentication and minimal authorization capabilities.



Srikanth Sundarrajan works at Inmobi Technology Services, helping architect and build their next generation data management system. He is one of the key contributors to Apache Falcon and currently VP of the project. He has been involved in various projects under the Apache Hadoop umbrella including Apache Lens, Apache Hadoop-core, and Apache Oozie. He has been working with distributed processing systems for over a decade and Hadoop in particular over the last 7 years. Srikanth holds a graduate degree in Computer Engineering from University of Southern California.


Read more »

Architecture