Recent developments...

Deep dive into Actian Vortex architecture

The innovations continue at a rapid pace in SQL on Hadoop solutions with each vendor trying to outsmart the competition. In this second part of interview with Actian’s Emma McGrattan, we try to understand architecture of Actian Vortex’s SQL in Hadoop offering with particular focus on database/SQL layer named Vector. Emma is the Senior Vice President for Engineering at Actian and described the "Marchitecture" (as she likes to term it) in a conversation with Sachin Ghai. As per Emma, Actian Vortex product suite is among the fastest and most mature SQL 'in' Hadoop offering. 

Actian Engineering has definitely put a lot of thought and innovation in the Vortex architecture. It is one of those products where the engineering team exactly knew the nuts and bolts of Hadoop as well as the cranks and shafts of database. It is rare currently to find an SQL offering which relies on HDFS as storage but still achieves enterprise grade resonant with the database category. Utilizing YARN more as an enabler and replication as an added blessing, Actian Vortex has made it’s Vector database off-Hadoop to seamlessly transition to Hadoop. With more production implementations in future months, we can expect hardware disk requirements to get further optimized and in-memory features to take center stage. 
Read more to understand its technical and functional architecture.

You mentioned about HDFS being one of bottlenecks for performance but what about YARN. Is YARN a bottleneck or an enabler for SQL on Hadoop solutions?

It's funny because it's both an enabler and can actually be a real challenge. Where it becomes a real challenge is the fact that YARN is not designed for long running tasks - it's designed for MapReduce jobs which are typically very short in duration. Typically for a database you expect to start the database and it should just keep running until you need to stop it. It's a long running task in that one would expect for a full quarter without ever coming down for maintenance and that's a challenge for YARN. We do some things within our products so that when we're using YARN, we can allow for the fact database is a long running task and be able to say I want to allocate more or less resources to the database. May be you are doing month end or quarter end processing and you want to give the database some additional cores from environment. You want to add in additional nodes that will speed up processing and then you want to remove those additional resources that you granted it when the quarter or the month is over. That's a challenge for YARN - so we needed to overcome that challenge because what we see is traditionally when you're dealing with EDW, you have periods of huge activity. What you want to do then is to give database as much resources you can spare and then take back afterwards. 

We had to figure out how to do that with YARN and then the other thing that we need to figure out is how we drop priority of the database. It may be you've got the ETL task that you want to give priority to - you may want ETL processing to get the most of the resources but once the ETL processing is done, you want the database to get to full strength. That was another thing that we figured out with YARN.  We were able to use YARN to get dynamic resource scheduling in terms of adding more resources and taking them away in a dynamic or elastic fashion. Typically when you're dealing with database, you are given a certain amount of memory that you're going to allow it to use over its lifetime and very rarely you want to shrink it. We were able to use YARN to figure out how to do that.

We also got the benefit of using YARN for some things like intelligent block replica placement. By default YARN will replicate the blocks that you write to HDFS and that's fantastic for database because you have that built-in failover capability. If you lose a disk there's another copy of the data after that you can cover from. But we were able to do was to tell YARN when we're writing block replicas that we want to tell where they are placed. If you're joining two tables and the tables are partitioned in such a way that the rows that you're joining aren't going to be co-located, we tell YARN we wanted to locate one of those replicas with the table that we are going to be joining it to. That way we were able to avoid any internal communication when doing complex query joins. YARN was hugely beneficial to do something like that. So yeah it's a double edged sword - it presents challenges for enterprise data warehouse but it also provides some capabilities that would be difficult if not impossible to implement and get those for free.

Taking example of join operation, can you give us insights into Vortex architecture?

As part of the platform we have built, we also have a visual workbench to figure out the data flow to take data into the Hadoop environment. It also gives a lot of capabilities around analytics. It is a simple point and click workbench that you drag some nodes on the screen and can build rich data science applications. The box number 1 on the figure is the visual workbench. The first thing we are doing here is reading data from Hadoop, maybe it's log file or smart meter data or something like that. If we follow this read it connects to our high-performance ingest, prep and load that's running on the name node in the cluster. That parallelizes read of source data - so we have these 4 green cylinders on the data nodes and that's just a data file that we're reading using data flow - read all parts in parallel and read that up into the name node in the cluster. Maybe at this point we may want to do some analytics and data science. We may want to do some ingest, prep and load as well and that's the scenario we're going to talk now because that's where our SQL on Hadoop capability comes in again. 

So we've read the data in parallel using our data flow product and now decide that we want to load data in SQL in Hadoop (Vector) technology. We follow the number 2 we're connected to the database master node. We have a master worker architecture and the master node is responsible for preparing the query execution plan and then dividing up the workload to all of the data nodes that are participating in solving the query. The database that's running on the master node generates the query execution plan and if you follow the blue line on the diagram that will then connect to what we call the X100 server. We call it X 100 because we believe it to be a hundred times faster than anything else out there and that's been heart of SQL on Hadoop solution that runs on data nodes. This X100 server will write data to disk so we look at the Blue cylinders here and we are right to blue blocks on disk. We can run that X 100 server on a subset of the nodes and when you install the product you decide which data nodes are going to be participating in your workers set. I mentioned already that you can grow and shrink this depending upon your business requirement. 

In this case we're writing with data out to 4 nodes and so we have got here 4 blue cylinders that are used to illustrate 4 partitions we are writing to. You will see written underneath here then is the block replica and these grey cylinders represent the fact that these four blocks are going to be replicated three times. For simplicity sake we want to join all of the data that is in Data Vector A with the Data Vector B. Rather than having to pull all of the data from Data Vector B over to node where the Data Vector A lives, we actually have a replica of that block that is local to us. That replica of B is right there and we are able to join with the data in there. HDFS has already guaranteed that's an exact replica of the data that is held in Vector B and we are able to join with and remove the need to communicate between the nodes. 

We are talking of big data and each of these tables could represent billions or hundreds of billions of rows of data. We want to avoid having to move data around the cluster if at all possible. Using the block replicas to perform these joins is something that really yields great performance results. When we showed you earlier that slide where we compared performance with Impala, that's one of the reasons. If you look again to slide for queries that are over 30 times faster than Impala, and Impala spends a lot of time data moving data around between nodes. We can out-perform them because of other architectural benefits that we have but not having to move data around for complex joins is something that is bringing benefits in complex queries. 

When you create the tables and you define your primary key foreign key relationships between the table, we use that as a hint as to partitioning the data and building up the location of the block replicas - for figuring out where we want to locate those replica so that we can benefit from a performance perspective. And the other thing that influences placement for the block replica is making sure that should we lose a node in the environment, our performance doesn't drop off the cliff. The loss of one node will impact performance because you're dealing with 3 nodes instead of 4 and you expect performance to be impacted in that scenario by 25 percent. You don't want all joins that you do subsequently to be remote joins and that comes in play also to determine how you place replicas. That's how we perform the joins.

If we look to architecture diagram and that 4th number that's there in Black is the standard SQL applications on BI tools that can just connect directly to the master node and you don't need to make any changes to the those applications. Without changing a line of code we can provide a scalable solution. We have customers that were previously on our Vector solution and they've moved now to Vector in Hadoop and they are seeing better than linear scalability. For our customers not having to make any changes to application and get a linear scalability is very exciting. 

Is metadata for Vortex available to any other agents outside Vortex?

Today, you can access metadata through JDBC/ODBC and some of the APIs but it is not available as part of HCatalog. In the release which is coming up in the first quarter of 2016, we will actually be using HCatalog for storing our metadata. That will be externally accessible by more than just the standard database API. As part of that release, we will also be adding external table support and that will enable you to take any other data source and just register it as a table within the database. We will be able to join data that's held in the Actian SQL in Hadoop solution with any other data in the environment. Let's say you want to join our data with parquet data, you would need to actually load the parquet data in the environment using the Data Flow solution. But in the release that we have coming out in the first quarter of next year you can leave that data in Parquet. We prefer that you import the data into our solution because performance is a lot greater but what we do recognize a lot of people have standardized on parquet for data storage. (So we will provide) the ability to register those parquet file as tables within the database and then just treat them as Vector table. 

How does Vortex support window functions and subqueries?

We have full sub querying and window function support - we have cube, roll-up, grouping set - all of the advanced analytics capabilities that are called out in the SQL language spec and have been incorporated within the product. They have been there for a number of years.
The Vector on Hadoop technology began life as a product Vector off Hadoop almost a decade ago and has been around for some time. It is an analytics engine but you know people take off the shelf tools like Cognos, MicroStrategy, Informatica, Tableau and so on and run it against the database. They have requirements like window framing, multiple window functions, grouping sets and so on - all of those are built into the product. The SQL language committee does not call out a specific analytic language set per se.  So it is not like there's you can say "Oh! look, I'm SQL 2015 analytic compliant" because it's not the way the language is defined. But we do have all the analytics capabilities that are described by both the ANSI and the ISO SQL committees.

Is Vortex truly open source or is it just the API that is open?

Vortex is not open source but we are in the process of publishing a set of libraries that will enable other products to access data that's been written to disk by Vector. They don't need to go through the Vector engine to access the data. Projects that could benefit from storing data in Vector format can use that but also products that want to read data directly without having to go through Vector can do so. This is something that was raised by a couple of our early prospects where they didn't want to be tied into a proprietary solution on Hadoop. We did not want to open source Vector on Hadoop because what we do is quite unique. There's a lot of IP in there that we don't want to disclose at this point in time. We do want people to be able to store data in our format and read and write data to our file format. We're publishing a set of API that will enable people to do that without buying or using the product. 

Read more »

SQL on Hadoop landscape with Actian's Emma McGrattan

The SQL on Hadoop capability is turning out to be a real game changer for data warehouse and Hadoop vendors. To get a sense of what is happening in this intriguing space, Sachin Ghai talked to Emma McGrattan about SQL on Hadoop solutions. Emma is the Senior Vice President for Engineering at Actian and has the responsibility for analytics platform. In this 2 part interview published on HadoopSphere, Sachin asked Emma questions about the generic SQL on Hadoop ecosystem and got further technical insights into market space and technical architecture. Read more to understand about SQL on Hadoop solutions (or SQL 'in' Hadoop as Actian likes to term it).

What is the state of Hadoop ecosystem at this time? Where does it seem to be heading particularly with regards to SQL on Hadoop solutions?

If I could point you to SQL in Hadoop landscape (graphical image), that is a good slide to use to explain where we see landscape today and what we see as the future plans for some of the competitors in the space and how we position ourselves here. First off, if you look at the slide at the axis we use, the Y axis here is the product maturity. People have expectations when they are running SQL solutions and these are expectations that they've built up over the past decades because of experience with traditional enterprise data warehouse technologies and traditional relational database technologies. And so when you're dealing with the SQL language, there is an assumption that there's going to be the full language support and you're going to have the ability to run transactions.

SQL in Hadoop landscape as per Actian

Let's say take a very simple transaction such as transferring $100 for my checking account to my savings account I don't want to debit the checking account unless I know the savings account can be credited by that $100. So the idea of a transaction is something that's fundamental to people using databases and the ability to roll those back. Let's say I have made a mistake and I can't actually update my savings account, I will be able to roll back that transaction. These are ideas that people have that a database should provide and the SQL in Hadoop players because of their lack of maturity they don't have all of these features. 
On the maturity axis here we're looking at things like the language support capabilities and the security features that they provide their performance. So we look at this first circle that we've got on the graph here. We've got the traditional enterprise data warehouse players - so this is Teradata, Oracle, SQL Server 2012 and its Vertica. What these guys are doing they're trying to protect the revenue that they get from traditional enterprise data warehouses. These are public companies and have shareholders whose interests need to be protected. So they need to protect their revenue streams and what these guys are typically doing is they're continuing to push the enterprise data warehouse for the analytics needs. What they're doing is recognizing the fact that people are storing new data types and new data sources in Hadoop. So they provide connectors through to the Hadoop environment and which is why in the red title for this box, I have called them “connections”. What they do is they connect to data in Hadoop and they pull that data from Hadoop into the traditional enterprise data warehouse and they allow you to operate on it. So these guys are really not in Hadoop and they provide connections through to Hadoop data source and I believe that they provide a lot of the capabilities that one will expect in enterprise data warehouse. 

The next circle on here is “wrapped legacy” circle. These companies have taken a traditional relational database technology from the open source world - typically it's a Postgres although in the case of Splice Machine it's Apache Derby. What they've done is they're using the SQL capability of these and all of the capabilities that these legacy database solutions provide and then they are converting SQL statements in MapReduce jobs or finding some way of executing those queries on data resident in Hadoop. In terms of their maturity, they have a lot of capabilities because they've taken mature database technologies. But some of the capabilities that one would expect in terms of performance and scalability - they don't have that because under the underlying capabilities that they take from Hadoop such as MapReduce, just don't perform. They inherit capabilities that lack performance from underlying Hadoop capabilities. 

And third circle in here this green circle table what we label as “from scratch” players and this is the likes of Cloudera Impala, Hortonworks Hive or the IBM Big Insights. What these guys have done is they have built a solution from scratch specifically for Hadoop. They're not cannibalizing any other technologies. They started out with a clean sheet and they're looking to build capabilities that are native to Hadoop. So in terms of the X-axis here, they are very well integrated with Hadoop. But because they're being built from scratch, they're lacking all of the material that one expects from SQL on Hadoop solutions. Security is something that they're kind of bolting on up to the fact; performances is very poor and then they are missing some capabilities like the ability to perform updates. So if you want to have an individual record within a dataset updated these guys have really struggled how best to do that. For those solutions that have figured out how to do updates, they sacrifice performance on the rest of the queries - so that's something that you really lacking in terms of database maturity. But they're very tightly integrated with Hadoop. 

What we see as unique about what we do in our own separate circle on the graph is that we've got all of the capabilities that one expects in the traditional enterprise data warehouse. We've got the performance that one expects a traditional enterprise data warehouse and in fact we can outperform the traditional enterprise data warehouse even though we are on Hadoop. We're also very tightly integrated with Hadoop; we use HDFS for file storage and we use YARN for scheduling and failover capabilities and so on. We're very integrated with the Hadoop system but we are able to provide the capabilities that one often has to sacrifice when looking at a SQL on Hadoop solution. Specifically, security is something that a lot of these SQL on Hadoop solutions are lacking and specifically around the ability to update individual records - the capability that's known in the database word as 'trickle updates'. Bulk update is something that everybody can do but the ability to update individual records something that's difficult in Hadoop. The reason it's difficult is that the HDFS file system was designed so that you write data to the file system once and you only ever read it after that. HDFS as a file system wasn't designed for people to be frequently updating the records in there. It just wasn't the design point that they started with and that's why Cloudera is now looking at Kudu. So that they're looking at a completely different implementation because they just can't get HDFS to provide the capabilities that they need. 

Now we at Actian have figured out how to do this and we have a patent pending on the capability to do it. Essentially we keep track of changes and we keep track of the positions of data that's changed in the environment in the memory overlay structure. When you reference data that has been updated, we reference this overlay structure and we are able to figure out if the data that you need has changed. As you query the data, you will get the updated value of the result that is passed back to you. Now I can’t go into too much detail as to exactly how we've done this because I say a patent is pending and we do believe when we look at what Cloudera is doing in Kudu, they have borrowed a number of the ideas - because up until now we have been talking about this publicly and the Cloudera guys have learned a lot from development work and research work that we have done at Actian. 

And that's how we see the landscape - provide all of the capabilities on the maturity that one expects from the traditional enterprise data warehouse. We are also very tightly integrated with Hadoop using all of the Hadoop projects that can benefit an enterprise data warehouse like the file system and like YARN resource negotiation.

One of the key trends that is emerging is that most of the SQL on Hadoop solutions are beginning to support ACID transactions and in fact Hive has started supporting ACID transactions in 0.14 release. But most of these solutions are still of OLAP nature. Do you see these solutions moving towards OLTP support or is that still a pipe dream?

Splice machine today supports OLTP and that's their target right now. They don't do analytics so much as they do OLTP and when they talk about their performance they used the TPCC benchmark which is a well-known OLTP benchmark to demonstrate their capabilities. I think the challenge that all of the players have in supporting the OLTP workload is the fact that the underlying filesystem just was not designed for it. There's a couple of things that get in the way: obviously if it's not designed for supporting random updates to the data and then taking a very heavy update intensive workload to file system is going to be a nightmare for disaster. That requires capabilities like we've had to implement whereby you actually keep track of all of the changes in memory and you have a Write Ahead Log file because you need that consistency that ACID guarantees. Should you lose power in your data center and the cluster disappears you want to make sure that  any of the transactions that were committed at that point in time can be recovered during the cold start recovery process. I think projects like Kudu will help but I think there's a fundamental problem with the underlying distributed file system in Hadoop that is causing a real challenge for bringing OLTP workloads to Hadoop.

Another key question which most customers come up is that how difficult or easy is it for them to migrate the legacy SQL and ETL workloads they were running in the traditional warehouse to the big data warehouse. What has been your experience in getting those SQL workloads migrated to the big data clusters?

When I talk about the guys that are lower on the SQL maturity scale, two things that they're missing. One is the full SQL language support so they will have a very limited subset of SQL language implemented and that is a problem when taking existing workloads to Hadoop. The reason for that is that traditional workloads most times use off-the-shelf tools for generating reports and dashboards and so on. You don't have control over exactly what syntax is going to be produced by those tools - so that you cannot then work around deficiencies in their SQL implementations. That's the first challenge they face and they're trying to round out their SQL language support. The 'from scratch' and 'the wrapped legacy' players don't have a complete language implementation. For the 'from scratch' guys, it's just because they haven't had the time to complete it yet and for 'the wrapped legacy' guys it's because the underlying capabilities that they're depending upon don't provide everything. So you know it may be that they're taking SQL and then converting to MapReduce jobs - so they don't get the full capabilities that one needs for supporting these traditional workloads. 

The second challenge that these 'from scratch' players is that their query optimizers are very immature. If you look at Hive or Cloudera and look at how they optimize queries. They don't have the smarts to figure out what's the best order in which the tables were you referencing multiple tables within a query. To work around that limitation, what they do is that they require that you list the tables in the order which you want them joined. As you write the SQL as an application developer, you need to know which is going to be the most efficient join order of the tables and you need to code that in query. As I said earlier, traditional workloads used tools like Business Objects, Tableau and any visualization technology. You don't have control over SQL that they're generating and so you don't have control over things like join orders. That's a real challenge and when we look at the 'from scratch' players and the 'wrapped legacy' players they really struggle to take traditional enterprise data warehouse workloads to Hadoop using those technologies. 

With the Actian solution, we have the complete SQL language implementation and we also have invested many decades in optimization technologies. What we were able to do is to take our Cost Based Query Optimizer to Hadoop and modify it for some of the unique scenario that one encounters in Hadoop. We can outperform the competition on Hadoop by quite a margin if you look (at the performance benchmark diagram). You'll see here is that when we compare our performance with that of Impala and if you read this graph correctly it's a little confusing. What we did was we took a workload that the Impala team used to demonstrate their superiority over Hive. They ran these queries and said that they were 100 times faster over Hive. What we did was we ran them against the Actian SQL in Hadoop and we show for instance for first query, we are 30 times faster than Impala.

We can outperform these guys by quite a margin because we know how to solve complex queries and we also can take a standard EDW workload and deploy it on Hadoop without having to change anything. We have some very large customers and that were previously off Hadoop with our Vector technology. We moved them to Vector on Hadoop and without changing anything except the connection string where they pointed to the database they were able to run their enterprise data warehouse workload against SQL in Hadoop solution. Being able to do that for a customer is very exciting, it's very empowering and for us it means certifying the latest versions of Cognos or Informatica or whatever it may be.  We certify them as part of our standard QA process and we know that any customer that's using any official tool for EDW today can move their workload to our SQL in Hadoop solution without having to compromise performance or functionality or security.

That's pretty interesting and these benchmarks are often talking points since they hide more than they show.

That is a good thing that you raised...You'll see that the SQL that we use in Vector is a standard SQL. These benchmarks typically what they are measuring is the performance at the database engine - so they usually will restrict to 100 rows that come back from the query. Those top 100 rows from three tables that we're going to join and then we've got some join conditions here and some group by and order by. If you look at... the Impala equivalent in their benchmark capability, what they actually do is they specify the partition keys that are used for solving this query. They have completely cheated in this benchmark. They know exactly where on disk the rows that are gonna satisfy the query lie and they point exactly to those rows. But even with them playing that trick, we will outperform them by thirty times. 

The benchmarks are not something that I would trust unless they came from a party like TPC and they are audited benchmarks. In this particular case that we call as the Impala subset of TPC-DS they identified the queries where they were going to out-perform Hive, they rewrote them because they have limited SQL support and then they added the partition keys. So it's a game - they cheated and I call them on their cheating when I use this presentation publicly because I think it's wrong. 

Everybody is playing games with these benchmarks and we can use them to demonstrate whatever you chose. But for us we're using standard SQL. We have clean hands in this and we also plan something for the first half of next year to publish some audited benchmark results. We're not running the TPC-DS because that benchmark is under review with TPC Council right now and it looks like they're going to deprecate it replacing it with something for Hadoop. We are going to take more traditional workload TPC-H benchmark and we are going to publish numbers in first half of next year and these will be audited benchmark numbers for TPC-H benchmark on Hadoop. We will be the first database vendor to publish audited TPC benchmarks for Hadoop. We are very excited by this because our performance is going to be, I hope, an order of magnitude greater than traditional enterprise data warehouse players. We expect to beat Teradata, Oracle, SQL Server and Vertica by an order of magnitude which is huge when you are dealing with Hadoop which is known as commodity workhorse platform but is not known for performance.

Read more »

Apache Spark turning into API haven

Apache Spark is the hottest technology around in big data. It has the most generous contributions from the open source community. But with the latest release of Apache Spark 1.6, there is clearly a pattern evolving in where it is heading. And, currently that road seems to be an API haven. (Note, the term being used here is haven and not heaven).

Spark innovations in 2015

March 2015: Spark 1.3 released.

DataFrames introduced: SchemaRDD renamed and further innovated to give rise to DataFrames. DataFrames are not just RDDs with schema but have a huge army of useful operations that could be invoked with an exhaustive API.
For some strange reason, DataFrames were decided to be more of relational nature only and so were pitched directly along with Spark SQL. A developer could use either DataFrame API or could use SQL to query relational form data which could be residing in tables or any Spark supported data source (like Parquet, JSON etc.).

Nov 2015: Spark 1.6 released.

Datasets introduced: specialized DataFrames which can operate on specific JVM object type and not just rows. Essentially Dataset uses a logical plan created by Catalyst, the algorithmic engine behind Spark SQL/DataFrame and thus can do a lot of logical operations like sorting and shuffling. In a future release, you can expect DataFrames API to change and extend Dataset. 
Nutshell, Datasets are more intelligent objects compared to vanilla RDDs and if you might have guessed it, could be the future pinning of Spark API. 
Is it time to say "bye bye RDD, welcome Dataset"?

Happy or Upset with new Spark release?

Are you sounding relieved or almost upset with new Spark release– could depend on what stage are you in Spark journey.
For an indicative sample, answer could depend on what you doing.

You could be upset if:
- You have already been using Spark API and more or less re-settled yourself on Spark 1.3.x being distributed with CDH, HDP and MapR.
- You were wishing for a more powerful and SQL-rich Spark SQL but instead see focus shifting from SQL to programming API.
- You were wishing for a tighter integration between SQL and ML rather than DataFrames API and ML Pipelines API.

You could be delighted if:
- You are a devout Cloudera Impala or Hive fan and loved their performance! You always wanted to stick with these rather than Spark SQL offering more performant in-memory analytical power.
- You were a Storm and Mahout fan and were looking for ways to avoid shifting to Spark! You now know that coding in Spark may not be that cake-walk as it was promised to be since there are frequent API changes and over-reliance on experimental API for exposing functions in the promised unified platform.
- You were building your stealth next top big data product and were suddenly delighted to figure out that it’s difficult for users to remain settled in competing Spark! What was relevant in Spark around 6 months back may require re-factoring/re-engineering now.

Overall, the request to the brilliant and brave Spark core committers is to have a rethink and start loving the world outside programming API. Spark has been given the official throne of big data execution engine. So, it now needs to settle out on the surface while it keeps to continuously innovate under the covers. Spark is no longer ‘experimental’ and the term needs to move out of its documentation and strategy. It has to be hardened, enterprise grade and long serving to the end user. That's the only way for Spark to keep shining.

Read more »

Large scale graph processing with Apache Hama

Recently Apache Hama team released official 0.7.0 version. According to the release announcement, there were big improvements in Graph package. In this article, we provide an overview of the newly improved Graph package of Apache Hama, and the benchmark results that performed by cloud platform team at Samsung Electronics.

Large scale datasets are being increasingly used in many fields. Graph algorithms are becoming important for analyzing big data. Data scientists are able to predict the behavior of the customer, the trends of the market, and make a decision by analyzing the graph structure and characteristics. Currently there are a variety of open source graph analytic frameworks, such as Google’s Pregel[1], Apache Giraph[2], GraphLab[3] and GraphX[4]. These frameworks are aimed at computations varying from classical graph traversal algorithms to graph statistics calculations such as triangle counting to complex machine learning algorithms. However these frameworks have been developed each offering a solution with different programming models and targeted at different users. In this article, we introduce the Apache Hama[5] and its graph package.

What is Apache Hama?

Apache Hama is a general-purpose Bulk Synchronous Parallel (BSP)[6] computing engine on top of Hadoop. It provides a parallel processing framework for massive scientific and iterative algorithms. BSP is an easy and flexible programming model, as compared with traditional models of Message Passing, as shown in Figure 1.

Figure 1. Bulk Synchronos Parallel (BSP) Model.

Hama performs a series of supersteps based on BSP. A superstep consist of three stages: local computation, message communication, and barrier synchronization. Hama is suitable for iterative computation since it is possible that input data which can be saved in memory is able to transfer between supersteps. However, MapReduce must scan input data in each iteration, and then output data must be saved in file system, such as HDFS. Hence, Hama can solve the problems which MapReduce cannot handle easily.

Graph Package of Apache Hama

Apache Hama also supports a graph package which allows users to program applications for graph-parallel computations[7]. The vertex-centric model is suggestive of MapReduce[8] in that users focus on a local action, processing each item independently, and the system compose these actions to lift computation to a large graph dataset. It is easy to implement and prove to be useful for many graph algorithms.

Figure 2. Finding the maximum value in a graph example. 
Dotted lines are messages. Grey vertices have voted to halt.

Figure 2 illustrates the concept of graph processing in Hama using a simple example: given a connected graph where each vertex contains a value, it propagates the latest value to every vertex. Any vertex that has known a larger value from its messages sends it to all its neighbors in each superstep. When no more vertices change in a superstep, the algorithm terminates. If you want to know graph package in Hama, please refer to Apache Hama Programming[7].

Hama in Action

We performed the performance of the version of 0.7.0 of Hama, which is recently release, on Amazon Web Service Elastic Map Reduce(EMR) instances. PageRank[9] algorithm is used for benchmark test of the graph package. PageRank is an algorithm that is used to rank web pages according to their popularity. This algorithm calculates the probability that a random walk would end in a particular vertex of a graph. This application computes the page rank of every vertex in a directed graph iteratively. At every iteration t, each vertex computes the following:
where r is the probability of the a random jump, E is the set of directed edge in the graph and PRt(j) denotes the page rank of the vertex at iteration t.
We compared the performance of Hama with Giraph which is a graph framework that is used in Facebook. PageRank algorithm was already implemented in both of the frameworks. Default input types in Hama is text type and in Giraph is JSON. We implement input format of JSON to run in Hama for a fair comparison. JSON type format is as follows:

Above datasets are generated using fastgen example program which generates random graph datasets. The way fastgen generate the datasets on Hama cluster is as follows:

% bin/hama jar hama-examples-x.x.x.jar gen fastgen -v 1000 -e 100 -of json -o /randomgraph -t 40

  The following is the meaning of the options:

Experimental Results

In this section, we describes our experimental setup with details about datasets and experimental platform. We conducted various experiments with PageRank algorithm on EMR cluster. We used graph datasets randomly generated using fastgen example in Hama. Also we run benchmarks on EMR cluster which consists of instance type of r3.xlarge[10]. This instance type is memory-optimized instance type which has 30GB of RAM, and 4 vCPU. The cluster run on the hadoop 1.0.3(Amazon Machine Image version 2.4.11). That is because Giraph didn’t work well on hadoop 2.

Hama provides pure BSP model via own cluster. Furthermore, it works on both Hadoop YARN[11] and Apache Mesos[12]. Although Giraph has same programming model like Hama, it works as a MapReduce job which means that it requires MapReduce framework. So Hama performed PageRank algorithm making its own cluster. On the other hands, Giraph perform MapReduce application on hadoop.

For benchmarking of scalability, we increased the number of machines from 5 to 30 on a fixed dataset which has one billion edges (Figure 4). Both frameworks present significant scalability for same datasets. However, Hama has much lower execution time than Giraph for the same data set.

Figure 3. The execution time on same data set, depending on the size of machines.

From the computing performance’s point of view, Figure 4 shows execution time on same nodes, depending on the size of dataset. Hama also shows more powerful performance than Giraph for the same machines.

The main reason that the result of benchmark shows different performance in spite of the same programming model is that we use the advanced PageRank algorithm which uses aggregators for detecting the convergence condition and the BSP framework’s efficient messaging system. Hama uses own outgoing/incoming message manager instead of Java's built-in queues. It stores messages in serialized form in a set of bundles (or a single bundle) to reduce the memory usage and RPC overhead. Also unsafe serialization is used to serialize Vertex and its message objects more quickly. Instead of sending each message individually, Hama packages the messages per vertex at once and sends a packaged message to their assigned destination nodes. With this Hama v0.7 achieved significant improvement in the performance of graph applications.

Figure 4. The execution time on same machine, depending on the size of dataset.

Conclusion and future work

In this article, we presented a graph package of Hama. We also performed the performance of graph package, with Hama. The performance compared with Giraph, in respect of computing and scalability. As a result, the performance and scalability are already satisfactory for graphs with billions of vertices.

However, there are also a lot of improvement to be done based on the current version. Efficient load balancing, spillable vertices storage and message serialization are challenging issues. We look forward to add these features and see our community growth.



[1] G. Malewicz, M. Austern, A. Bik, J. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. Principles Of Distributed Computing (PODC), 2009.
[2] Apache Giraph v 1.1.0
[3] GraphLab
[4] Apache Spark’s GraphX
[5] Apache Hama v 0.7.0.
[6] Leslie G. Valiant, A bridging model for parallel computation, Communications of the ACM, Volume 33 Issue 8, Aug. 1990
[7] Apache Hama BSP programming,
[8] Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of the ACM 51.1 (2008): 107-113.
[9] Page, Larry, et al. PageRank: Bringing order to the web. Vol. 72. Stanford Digital Libraries Working Paper, 1997.
[10] Amazon EC2 instances,
[11] Apache Hadoop YARN arhitecture,
[12] Apache Mesos,
[13] Satish, Nadathur, et al. "Navigating the maze of graph analytics frameworks using massive graph datasets." Proceedings of the 2014 ACM SIGMOD international conference on Management of data. ACM, 2014.
[14] General dynamics, High Performance Data Analytics,

About the author:

Minho Kim is a software engineer at Samsung Electronics in the Cloud Technology Lab. Minho is a committer of Apache Hama project, which is a general BSP computing engine. He has been developing and contributing to Apache Hama. Minho’s research interests include deep learning such as DNN, and CNN.

Pre-register now for HadoopSphere Virtual Conclave 
for a chance to get a free invite to 
best classes and sessions on 
Spark, Tajo, Flink and more...
Read more »

Low latency SQL querying on HBase

HBase has emerged as one of the most popular NoSQL database offering distributed, versioned, non-relational tables hosted on commodity hardware. However, with a large set of users coming from a relational SQL world, it made sense to bring the SQL back in this NoSQL. With Apache Phoenix, database professionals get a convenient way to query HBase through SQL in a fast and efficient manner. Continuing our discussion with James Taylor, the founder of Apache Phoenix, we focus on the functional aspects of Phoenix in this second part of interaction.

Although Apache Phoenix started off with distinct low latency advantage, have the other options like Hive/Impala (integrated with HBase) caught up in terms of performance?

No, these other tools such as Hive and Impala have not invested in improving performance against HBase data, so if anything, Phoenix's advantage has only gotten bigger as our performance improves. 
See this link for comparison of Apache Phoenix with Apache Hive and Cloudera Impala.

Apache Phoenix and Cloudera Impala comparison
(Query: select count(1) from table over 1M and 5M rows)

What lies ahead on roadmap for Apache Phoenix in 2015?

Our upcoming 4.4 release introduces a number of new features:  User Defined Functions, UNION ALL support, Spark integration, Query Server to support thin (and eventually non Java) clients, Pherf tool for testing at scale, MR-based index population, and support for HBase 1.0.

We are also actively working on transaction support by integrating with Tephra ( If all goes according to plan, we'll release this after our 4.4 release (in 4.5 or 5.0), as this work is pretty far along (check out our txn branch to play around with it).

In parallel with this, we're working on Apache Calcite integration to improve interop with the greater Hadoop ecosystem through plugging into a rich cost-based optimizer framework. IMHO, this is the answer to ubiquitous usage of Phoenix for HBase data across queries that get data from any other Calcite adapter source (RDBMS, Hive, Drill, Kylin, etc.). This will allow a kind of plug and play approach with this the push down being decided based on a common cost model that all these other tools plug into.

Come hear more and see a demo at our upcoming Meetups or at HBaseCon 2015. 

Does Apache Phoenix also talk to HCatalog or is that interaction left off to HBase itself?

Phoenix manages its metadata through a series of internal HBase tables. It has no interaction with HCatalog.

Can Apache Phoenix be connected with BI tools which have traditionally relied on ODBC drivers?

Phoenix can connect with BI tools that support a JDBC driver. However, BI tools that rely on an ODBC driver are more challenging. There's a new thin driver plus query server model that we support in our upcoming 4.4 release which will help, though. This thin driver will open the door for an ODBC driver to be achievable by writing the same protocol that our Java-based thin driver use (JSON over http).

Which commercial distributions is Apache Phoenix part of? 

Apache Phoenix is available in the Hortonworks HDP distribution. Make sure to let your vendor of choice know that you'd like to see Phoenix included in their distribution as well, as that's what will make it happen.

James Taylor is an architect at in the Big Data Group. He founded the Apache Phoenix project and leads its on-going development efforts. Prior to Salesforce, James worked at BEA Systems on projects such as a federated query processing system and a SQL-based complex event programming platform, and has worked in the computer industry for the past 20+ years at various start-ups. He lives with his wife and two daughters in San Francisco.

Read more »

SQL on HBase with Apache Phoenix

Apache Phoenix is a powerful tool for implementing a relational layer on top of NoSQL HBase database. Low latency query model combined with SQL give the tool a strong edge over HBase API or shell commands. To discover more about the tool and find out how it works, HadoopSphere interacted with James Taylor, founder of Apache Phoenix.

How does Apache Phoenix fit in the crowded SQL on Hadoop space?

Apache Phoenix is specifically meant for accessing HBase data. It's not trying to compete with the Hive and Impalas of the world. We do very specific optimizations for HBase to ensure you get the best possible performance.

What are the complexities that Apache Phoenix abstracts away from the user?

Phoenix abstracts away the same complexities that a RDBMS does when you use SQL versus reading raw bytes from files that represent a page of data on disk. Sure, this is a bit of an exaggeration, but not by a lot. The HBase client APIs are very low level, at the level of byte arrays. There's no support for composite row keys in HBase, for example, so your left bit twiddling the start and stop row in your scans. Performance in HBase changes with each version, so in prior versions you'd get a 25% boost by avoiding the use of the ExplicitColumnTracker.

Above and beyond these lower level details are higher level features such as secondary indexes (now supporting functional indexes), user defined functions (available in 4.4), and just the overall capabilities for querying in SQL (with a parallel execution) versus writing lots and lots of Java code (with a serial execution). Try writing a N-way join with aggregation and union in the HBase client API. And then compare the performance. And then change your data model and see which one takes longer to adapt.
Apache Phoenix architecture

Can you tell us about internal architecture of Apache Phoenix? How does it work?

Apache Phoenix takes your SQL query, compiles it into a series of HBase scans, and orchestrates the running of those scans to produce regular JDBC result sets. The table metadata is stored in an HBase table and versioned, such that snapshot queries over prior versions will automatically use the correct schema. Phoenix has a pretty typical query engine architecture with a parser, compiler, planner, optimizer, and execution engine. We push as much computation as possible into the HBase server which provides plenty of hooks for us to leverage. This helps us achieve our stellar performance.

Taking an example, can you help us understand Apache Phoenix's execution plan?

Take a look at the video from my HBaseCon presentation here  to get a good overview (about 15mins in). Other good references can be found here on Apache Phoenix website.

James Taylor is an architect at in the Big Data Group. He founded the Apache Phoenix project and leads its on-going development efforts. Prior to Salesforce, James worked at BEA Systems on projects such as a federated query processing system and a SQL-based complex event programming platform, and has worked in the computer industry for the past 20+ years at various start-ups. He lives with his wife and two daughters in San Francisco.

Read more »