Recent developments...

Top Big Data Influencers of 2015

2015 was an exciting year for big data and hadoop ecosystem. We saw hadoop becoming an essential part of data management strategy of almost all major enterprise organizations. There is cut throat competition among IT vendors now to help realize the vision of data hub, data lake and data warehouse with Hadoop and Spark.

As part of its annual assessment of big data and hadoop ecosystem, HadoopSphere publishes a list of top big data influencers each year. The list is derived based on a scientific methodology which involves assessing various parameters in each category of influencers. HadoopSphere Top Big Data Influencers list reflects the people, products, organizations and portals that exercised the most influence on big data and ecosystem in a particular year. The influencers have been listed in the following categories:

  • Analysts
  • Social Media
  • Online Media
  • Products
  • Techies
  • Coach
  • Thought Leaders


Doug Henschen It might have been hard to miss Doug Henschen writing for InformationWeek. With his accomplished media experience and proven expertise in industry analysis, Doug has now joined Wang at Constellation Research talking about big data. His current focus areas include good data, streaming, cloud solutions and self-service of data.

Merv Adrian The saner voice on big data in the important research firm Gartner, Merv Adrian makes sure we make sense out of the dichotomy between data warehouse and data lake. He understands the breadth and depth of Hadoop ecosystem and provides the vision to cross the hype.

Tony Baer When you talk to Tony Baer, don’t expect rebel thoughts just plain incisive wisdom unravelling with each statement. More prose looking like poetry, the analysis casts an indelible effect on your understanding of the big data ecosystem. He remains top of the Hadoop analyst game for many years in a row now.

Social Media:

Bernard Marr Bernard Marr is an author, speaker and consultant with wider interests in strategic performance, analytics, KPIs and big data.  He is the founder of Advanced Performance Institute and provides consulting to various organizations. Bernard has a massive following on Twitter and his LinkedIn posts' generate huge interaction and interest.

Cloudera Cloudera is the market leader in Hadoop distros and at the same time continues to influence social media followers. It may not have the most number of followers compared to other companies but most of it’s messages gets the right amplification and impact. Kudos to Cloudera social marketing team.

Gregory Piatetsky-Shapiro As the President of KDnuggets, Gregory is a founder of KDD (Knowledge Discovery and Data mining conferences). His social media messages attract the right amount of traffic and eye balls making him one of the most relevant social media influencers.

Online Media:

O’Reilly Media O’Reilly Media is a diversified group now with interests ranging from books to blogs, webinars to conferences. With Strata Hadoop World as one of its most visible product now after books, O’Reilly media is definitely shaping up the big data opinion in the industry.

TDWI With research papers, blogs, webinars and education events, TDWI continues to attract impressions and leads for marketers.

The Cube The Cube is a pioneering online television series filmed at various industry events. It brings the best minds on the show speaking up the future of big data. Chic image setting television, it boasts of the CxO speakers like no other forum can.


Actian Vortex Actian Vortex is one real sharp SQL in Hadoop product which brings the best of database SQL to Hadoop and YARN world. With innovative engineering under the hood to support ACID transactions and higher performance, it has motivated quite a few solutions in its arena.

Apache Flink Apache Flink started off a research product and soon created a unique identity for its streaming capabilities. It has influenced quite a few features in other competing streaming products like off heap memory management, datasets and the like.

Kyvos Insights Kyvos Insights is an OLAP product building cubes at big data scale while assuring low latency SLA on Hadoop. With pre-canned cubes, interactive queries on terabytes of data within 2 seconds is a real possibility and an eye catcher. As the trendsetter for cubes on Hadoop, it has inspired a few other imitations on its trail but none at par so far.


Reynold Xin As one of the co-founders of Databricks and Apache Spark, Reynold Xin continues to influence major innovations in Spark including Tungsten memory management, Dataframes and many more. Sharp and futuristic, he is a real tech force.

Roman Shaposhnik With the Open Data Platform (ODP), Roman Shaposhnik has got a new home for corporate Hadoop and continues to lead the initiative magnificently. Pushing many other Apache projects alongside like BigTop and acting as mentors to others like Ignite, Roman emerged as a true tech leader in last year.

Todd Lipcon When Todd brought HA to Hadoop, he brought Hadoop to the enterprise infrastructure. When Todd Lipcon has brought Kudu to Hadoop, he has brought Hadoop to the enterprise database. Believe it or not, but Todd has unassumingly and unwittingly become the enterprise champion for Hadoop.


Paco Nathan If you are looking for a Spark session in an industry event or on online resources, chances are you have may have attended one of Paco Nathan’s session. Evil mad scientist as he likes to proclaim himself, he is much more than Spark and lot of data science, maths, venture capital and learning coach among his many-many interests.

Shane Curcuru Community over code and Apache open source over corporate proprietary, Shane Curcuru has been evangelizing Apache for years now. As one of ASF directors, he ensures Apache brand name is taken care of in right measure and the community driven projects get their right share of sun.

Thought Leaders:

Ion Stoica As one of the main founders of Apache Spark, Ion Stoica already has rallied the entire data world around one product. However, his vison with Databricks does not seem to be confined to just a batch execution engine. It seems Databricks is out there to get a bigger share of data center with its cloud offerings and the innovations continue rolling in at an unprecedented velocity.

Mike Olson As the Chief Strategy Officer and Chairperson of Cloudera, Mike Olson has made sure Cloudera remains at the top of the Hadoop game. Resisting off the market IPO or acquisition bait and maintaining the innovation path, he has been keeping Cloudera a steady ship.  Open to disruptions like Spark and embracing partners, he has been one true leader who thinks and acts with vision and authority.

<< HadoopSphere Top Big Data Influencers of 2014 
Read more »

Deep dive into Actian Vortex architecture

The innovations continue at a rapid pace in SQL on Hadoop solutions with each vendor trying to outsmart the competition. In this second part of interview with Actian’s Emma McGrattan, we try to understand architecture of Actian Vortex’s SQL in Hadoop offering with particular focus on database/SQL layer named Vector. Emma is the Senior Vice President for Engineering at Actian and described the "Marchitecture" (as she likes to term it) in a conversation with Sachin Ghai. As per Emma, Actian Vortex product suite is among the fastest and most mature SQL 'in' Hadoop offering. 

Actian Engineering has definitely put a lot of thought and innovation in the Vortex architecture. It is one of those products where the engineering team exactly knew the nuts and bolts of Hadoop as well as the cranks and shafts of database. It is rare currently to find an SQL offering which relies on HDFS as storage but still achieves enterprise grade resonant with the database category. Utilizing YARN more as an enabler and replication as an added blessing, Actian Vortex has made it’s Vector database off-Hadoop to seamlessly transition to Hadoop. With more production implementations in future months, we can expect hardware disk requirements to get further optimized and in-memory features to take center stage. 
Read more to understand its technical and functional architecture.

You mentioned about HDFS being one of bottlenecks for performance but what about YARN. Is YARN a bottleneck or an enabler for SQL on Hadoop solutions?

It's funny because it's both an enabler and can actually be a real challenge. Where it becomes a real challenge is the fact that YARN is not designed for long running tasks - it's designed for MapReduce jobs which are typically very short in duration. Typically for a database you expect to start the database and it should just keep running until you need to stop it. It's a long running task in that one would expect for a full quarter without ever coming down for maintenance and that's a challenge for YARN. We do some things within our products so that when we're using YARN, we can allow for the fact database is a long running task and be able to say I want to allocate more or less resources to the database. May be you are doing month end or quarter end processing and you want to give the database some additional cores from environment. You want to add in additional nodes that will speed up processing and then you want to remove those additional resources that you granted it when the quarter or the month is over. That's a challenge for YARN - so we needed to overcome that challenge because what we see is traditionally when you're dealing with EDW, you have periods of huge activity. What you want to do then is to give database as much resources you can spare and then take back afterwards. 

We had to figure out how to do that with YARN and then the other thing that we need to figure out is how we drop priority of the database. It may be you've got the ETL task that you want to give priority to - you may want ETL processing to get the most of the resources but once the ETL processing is done, you want the database to get to full strength. That was another thing that we figured out with YARN.  We were able to use YARN to get dynamic resource scheduling in terms of adding more resources and taking them away in a dynamic or elastic fashion. Typically when you're dealing with database, you are given a certain amount of memory that you're going to allow it to use over its lifetime and very rarely you want to shrink it. We were able to use YARN to figure out how to do that.

We also got the benefit of using YARN for some things like intelligent block replica placement. By default YARN will replicate the blocks that you write to HDFS and that's fantastic for database because you have that built-in failover capability. If you lose a disk there's another copy of the data after that you can cover from. But we were able to do was to tell YARN when we're writing block replicas that we want to tell where they are placed. If you're joining two tables and the tables are partitioned in such a way that the rows that you're joining aren't going to be co-located, we tell YARN we wanted to locate one of those replicas with the table that we are going to be joining it to. That way we were able to avoid any internal communication when doing complex query joins. YARN was hugely beneficial to do something like that. So yeah it's a double edged sword - it presents challenges for enterprise data warehouse but it also provides some capabilities that would be difficult if not impossible to implement and get those for free.

Taking example of join operation, can you give us insights into Vortex architecture?

As part of the platform we have built, we also have a visual workbench to figure out the data flow to take data into the Hadoop environment. It also gives a lot of capabilities around analytics. It is a simple point and click workbench that you drag some nodes on the screen and can build rich data science applications. The box number 1 on the figure is the visual workbench. The first thing we are doing here is reading data from Hadoop, maybe it's log file or smart meter data or something like that. If we follow this read it connects to our high-performance ingest, prep and load that's running on the name node in the cluster. That parallelizes read of source data - so we have these 4 green cylinders on the data nodes and that's just a data file that we're reading using data flow - read all parts in parallel and read that up into the name node in the cluster. Maybe at this point we may want to do some analytics and data science. We may want to do some ingest, prep and load as well and that's the scenario we're going to talk now because that's where our SQL on Hadoop capability comes in again. 

So we've read the data in parallel using our data flow product and now decide that we want to load data in SQL in Hadoop (Vector) technology. We follow the number 2 we're connected to the database master node. We have a master worker architecture and the master node is responsible for preparing the query execution plan and then dividing up the workload to all of the data nodes that are participating in solving the query. The database that's running on the master node generates the query execution plan and if you follow the blue line on the diagram that will then connect to what we call the X100 server. We call it X 100 because we believe it to be a hundred times faster than anything else out there and that's been heart of SQL on Hadoop solution that runs on data nodes. This X100 server will write data to disk so we look at the Blue cylinders here and we are right to blue blocks on disk. We can run that X 100 server on a subset of the nodes and when you install the product you decide which data nodes are going to be participating in your workers set. I mentioned already that you can grow and shrink this depending upon your business requirement. 

In this case we're writing with data out to 4 nodes and so we have got here 4 blue cylinders that are used to illustrate 4 partitions we are writing to. You will see written underneath here then is the block replica and these grey cylinders represent the fact that these four blocks are going to be replicated three times. For simplicity sake we want to join all of the data that is in Data Vector A with the Data Vector B. Rather than having to pull all of the data from Data Vector B over to node where the Data Vector A lives, we actually have a replica of that block that is local to us. That replica of B is right there and we are able to join with the data in there. HDFS has already guaranteed that's an exact replica of the data that is held in Vector B and we are able to join with and remove the need to communicate between the nodes. 

We are talking of big data and each of these tables could represent billions or hundreds of billions of rows of data. We want to avoid having to move data around the cluster if at all possible. Using the block replicas to perform these joins is something that really yields great performance results. When we showed you earlier that slide where we compared performance with Impala, that's one of the reasons. If you look again to slide for queries that are over 30 times faster than Impala, and Impala spends a lot of time data moving data around between nodes. We can out-perform them because of other architectural benefits that we have but not having to move data around for complex joins is something that is bringing benefits in complex queries. 

When you create the tables and you define your primary key foreign key relationships between the table, we use that as a hint as to partitioning the data and building up the location of the block replicas - for figuring out where we want to locate those replica so that we can benefit from a performance perspective. And the other thing that influences placement for the block replica is making sure that should we lose a node in the environment, our performance doesn't drop off the cliff. The loss of one node will impact performance because you're dealing with 3 nodes instead of 4 and you expect performance to be impacted in that scenario by 25 percent. You don't want all joins that you do subsequently to be remote joins and that comes in play also to determine how you place replicas. That's how we perform the joins.

If we look to architecture diagram and that 4th number that's there in Black is the standard SQL applications on BI tools that can just connect directly to the master node and you don't need to make any changes to the those applications. Without changing a line of code we can provide a scalable solution. We have customers that were previously on our Vector solution and they've moved now to Vector in Hadoop and they are seeing better than linear scalability. For our customers not having to make any changes to application and get a linear scalability is very exciting. 

Is metadata for Vortex available to any other agents outside Vortex?

Today, you can access metadata through JDBC/ODBC and some of the APIs but it is not available as part of HCatalog. In the release which is coming up in the first quarter of 2016, we will actually be using HCatalog for storing our metadata. That will be externally accessible by more than just the standard database API. As part of that release, we will also be adding external table support and that will enable you to take any other data source and just register it as a table within the database. We will be able to join data that's held in the Actian SQL in Hadoop solution with any other data in the environment. Let's say you want to join our data with parquet data, you would need to actually load the parquet data in the environment using the Data Flow solution. But in the release that we have coming out in the first quarter of next year you can leave that data in Parquet. We prefer that you import the data into our solution because performance is a lot greater but what we do recognize a lot of people have standardized on parquet for data storage. (So we will provide) the ability to register those parquet file as tables within the database and then just treat them as Vector table. 

How does Vortex support window functions and subqueries?

We have full sub querying and window function support - we have cube, roll-up, grouping set - all of the advanced analytics capabilities that are called out in the SQL language spec and have been incorporated within the product. They have been there for a number of years.
The Vector on Hadoop technology began life as a product Vector off Hadoop almost a decade ago and has been around for some time. It is an analytics engine but you know people take off the shelf tools like Cognos, MicroStrategy, Informatica, Tableau and so on and run it against the database. They have requirements like window framing, multiple window functions, grouping sets and so on - all of those are built into the product. The SQL language committee does not call out a specific analytic language set per se.  So it is not like there's you can say "Oh! look, I'm SQL 2015 analytic compliant" because it's not the way the language is defined. But we do have all the analytics capabilities that are described by both the ANSI and the ISO SQL committees.

Is Vortex truly open source or is it just the API that is open?

Vortex is not open source but we are in the process of publishing a set of libraries that will enable other products to access data that's been written to disk by Vector. They don't need to go through the Vector engine to access the data. Projects that could benefit from storing data in Vector format can use that but also products that want to read data directly without having to go through Vector can do so. This is something that was raised by a couple of our early prospects where they didn't want to be tied into a proprietary solution on Hadoop. We did not want to open source Vector on Hadoop because what we do is quite unique. There's a lot of IP in there that we don't want to disclose at this point in time. We do want people to be able to store data in our format and read and write data to our file format. We're publishing a set of API that will enable people to do that without buying or using the product. 

Read more »

SQL on Hadoop landscape with Actian's Emma McGrattan

The SQL on Hadoop capability is turning out to be a real game changer for data warehouse and Hadoop vendors. To get a sense of what is happening in this intriguing space, Sachin Ghai talked to Emma McGrattan about SQL on Hadoop solutions. Emma is the Senior Vice President for Engineering at Actian and has the responsibility for analytics platform. In this 2 part interview published on HadoopSphere, Sachin asked Emma questions about the generic SQL on Hadoop ecosystem and got further technical insights into market space and technical architecture. Read more to understand about SQL on Hadoop solutions (or SQL 'in' Hadoop as Actian likes to term it).

What is the state of Hadoop ecosystem at this time? Where does it seem to be heading particularly with regards to SQL on Hadoop solutions?

If I could point you to SQL in Hadoop landscape (graphical image), that is a good slide to use to explain where we see landscape today and what we see as the future plans for some of the competitors in the space and how we position ourselves here. First off, if you look at the slide at the axis we use, the Y axis here is the product maturity. People have expectations when they are running SQL solutions and these are expectations that they've built up over the past decades because of experience with traditional enterprise data warehouse technologies and traditional relational database technologies. And so when you're dealing with the SQL language, there is an assumption that there's going to be the full language support and you're going to have the ability to run transactions.

SQL in Hadoop landscape as per Actian

Let's say take a very simple transaction such as transferring $100 for my checking account to my savings account I don't want to debit the checking account unless I know the savings account can be credited by that $100. So the idea of a transaction is something that's fundamental to people using databases and the ability to roll those back. Let's say I have made a mistake and I can't actually update my savings account, I will be able to roll back that transaction. These are ideas that people have that a database should provide and the SQL in Hadoop players because of their lack of maturity they don't have all of these features. 
On the maturity axis here we're looking at things like the language support capabilities and the security features that they provide their performance. So we look at this first circle that we've got on the graph here. We've got the traditional enterprise data warehouse players - so this is Teradata, Oracle, SQL Server 2012 and its Vertica. What these guys are doing they're trying to protect the revenue that they get from traditional enterprise data warehouses. These are public companies and have shareholders whose interests need to be protected. So they need to protect their revenue streams and what these guys are typically doing is they're continuing to push the enterprise data warehouse for the analytics needs. What they're doing is recognizing the fact that people are storing new data types and new data sources in Hadoop. So they provide connectors through to the Hadoop environment and which is why in the red title for this box, I have called them “connections”. What they do is they connect to data in Hadoop and they pull that data from Hadoop into the traditional enterprise data warehouse and they allow you to operate on it. So these guys are really not in Hadoop and they provide connections through to Hadoop data source and I believe that they provide a lot of the capabilities that one will expect in enterprise data warehouse. 

The next circle on here is “wrapped legacy” circle. These companies have taken a traditional relational database technology from the open source world - typically it's a Postgres although in the case of Splice Machine it's Apache Derby. What they've done is they're using the SQL capability of these and all of the capabilities that these legacy database solutions provide and then they are converting SQL statements in MapReduce jobs or finding some way of executing those queries on data resident in Hadoop. In terms of their maturity, they have a lot of capabilities because they've taken mature database technologies. But some of the capabilities that one would expect in terms of performance and scalability - they don't have that because under the underlying capabilities that they take from Hadoop such as MapReduce, just don't perform. They inherit capabilities that lack performance from underlying Hadoop capabilities. 

And third circle in here this green circle table what we label as “from scratch” players and this is the likes of Cloudera Impala, Hortonworks Hive or the IBM Big Insights. What these guys have done is they have built a solution from scratch specifically for Hadoop. They're not cannibalizing any other technologies. They started out with a clean sheet and they're looking to build capabilities that are native to Hadoop. So in terms of the X-axis here, they are very well integrated with Hadoop. But because they're being built from scratch, they're lacking all of the material that one expects from SQL on Hadoop solutions. Security is something that they're kind of bolting on up to the fact; performances is very poor and then they are missing some capabilities like the ability to perform updates. So if you want to have an individual record within a dataset updated these guys have really struggled how best to do that. For those solutions that have figured out how to do updates, they sacrifice performance on the rest of the queries - so that's something that you really lacking in terms of database maturity. But they're very tightly integrated with Hadoop. 

What we see as unique about what we do in our own separate circle on the graph is that we've got all of the capabilities that one expects in the traditional enterprise data warehouse. We've got the performance that one expects a traditional enterprise data warehouse and in fact we can outperform the traditional enterprise data warehouse even though we are on Hadoop. We're also very tightly integrated with Hadoop; we use HDFS for file storage and we use YARN for scheduling and failover capabilities and so on. We're very integrated with the Hadoop system but we are able to provide the capabilities that one often has to sacrifice when looking at a SQL on Hadoop solution. Specifically, security is something that a lot of these SQL on Hadoop solutions are lacking and specifically around the ability to update individual records - the capability that's known in the database word as 'trickle updates'. Bulk update is something that everybody can do but the ability to update individual records something that's difficult in Hadoop. The reason it's difficult is that the HDFS file system was designed so that you write data to the file system once and you only ever read it after that. HDFS as a file system wasn't designed for people to be frequently updating the records in there. It just wasn't the design point that they started with and that's why Cloudera is now looking at Kudu. So that they're looking at a completely different implementation because they just can't get HDFS to provide the capabilities that they need. 

Now we at Actian have figured out how to do this and we have a patent pending on the capability to do it. Essentially we keep track of changes and we keep track of the positions of data that's changed in the environment in the memory overlay structure. When you reference data that has been updated, we reference this overlay structure and we are able to figure out if the data that you need has changed. As you query the data, you will get the updated value of the result that is passed back to you. Now I can’t go into too much detail as to exactly how we've done this because I say a patent is pending and we do believe when we look at what Cloudera is doing in Kudu, they have borrowed a number of the ideas - because up until now we have been talking about this publicly and the Cloudera guys have learned a lot from development work and research work that we have done at Actian. 

And that's how we see the landscape - provide all of the capabilities on the maturity that one expects from the traditional enterprise data warehouse. We are also very tightly integrated with Hadoop using all of the Hadoop projects that can benefit an enterprise data warehouse like the file system and like YARN resource negotiation.

One of the key trends that is emerging is that most of the SQL on Hadoop solutions are beginning to support ACID transactions and in fact Hive has started supporting ACID transactions in 0.14 release. But most of these solutions are still of OLAP nature. Do you see these solutions moving towards OLTP support or is that still a pipe dream?

Splice machine today supports OLTP and that's their target right now. They don't do analytics so much as they do OLTP and when they talk about their performance they used the TPCC benchmark which is a well-known OLTP benchmark to demonstrate their capabilities. I think the challenge that all of the players have in supporting the OLTP workload is the fact that the underlying filesystem just was not designed for it. There's a couple of things that get in the way: obviously if it's not designed for supporting random updates to the data and then taking a very heavy update intensive workload to file system is going to be a nightmare for disaster. That requires capabilities like we've had to implement whereby you actually keep track of all of the changes in memory and you have a Write Ahead Log file because you need that consistency that ACID guarantees. Should you lose power in your data center and the cluster disappears you want to make sure that  any of the transactions that were committed at that point in time can be recovered during the cold start recovery process. I think projects like Kudu will help but I think there's a fundamental problem with the underlying distributed file system in Hadoop that is causing a real challenge for bringing OLTP workloads to Hadoop.

Another key question which most customers come up is that how difficult or easy is it for them to migrate the legacy SQL and ETL workloads they were running in the traditional warehouse to the big data warehouse. What has been your experience in getting those SQL workloads migrated to the big data clusters?

When I talk about the guys that are lower on the SQL maturity scale, two things that they're missing. One is the full SQL language support so they will have a very limited subset of SQL language implemented and that is a problem when taking existing workloads to Hadoop. The reason for that is that traditional workloads most times use off-the-shelf tools for generating reports and dashboards and so on. You don't have control over exactly what syntax is going to be produced by those tools - so that you cannot then work around deficiencies in their SQL implementations. That's the first challenge they face and they're trying to round out their SQL language support. The 'from scratch' and 'the wrapped legacy' players don't have a complete language implementation. For the 'from scratch' guys, it's just because they haven't had the time to complete it yet and for 'the wrapped legacy' guys it's because the underlying capabilities that they're depending upon don't provide everything. So you know it may be that they're taking SQL and then converting to MapReduce jobs - so they don't get the full capabilities that one needs for supporting these traditional workloads. 

The second challenge that these 'from scratch' players is that their query optimizers are very immature. If you look at Hive or Cloudera and look at how they optimize queries. They don't have the smarts to figure out what's the best order in which the tables were you referencing multiple tables within a query. To work around that limitation, what they do is that they require that you list the tables in the order which you want them joined. As you write the SQL as an application developer, you need to know which is going to be the most efficient join order of the tables and you need to code that in query. As I said earlier, traditional workloads used tools like Business Objects, Tableau and any visualization technology. You don't have control over SQL that they're generating and so you don't have control over things like join orders. That's a real challenge and when we look at the 'from scratch' players and the 'wrapped legacy' players they really struggle to take traditional enterprise data warehouse workloads to Hadoop using those technologies. 

With the Actian solution, we have the complete SQL language implementation and we also have invested many decades in optimization technologies. What we were able to do is to take our Cost Based Query Optimizer to Hadoop and modify it for some of the unique scenario that one encounters in Hadoop. We can outperform the competition on Hadoop by quite a margin if you look (at the performance benchmark diagram). You'll see here is that when we compare our performance with that of Impala and if you read this graph correctly it's a little confusing. What we did was we took a workload that the Impala team used to demonstrate their superiority over Hive. They ran these queries and said that they were 100 times faster over Hive. What we did was we ran them against the Actian SQL in Hadoop and we show for instance for first query, we are 30 times faster than Impala.

We can outperform these guys by quite a margin because we know how to solve complex queries and we also can take a standard EDW workload and deploy it on Hadoop without having to change anything. We have some very large customers and that were previously off Hadoop with our Vector technology. We moved them to Vector on Hadoop and without changing anything except the connection string where they pointed to the database they were able to run their enterprise data warehouse workload against SQL in Hadoop solution. Being able to do that for a customer is very exciting, it's very empowering and for us it means certifying the latest versions of Cognos or Informatica or whatever it may be.  We certify them as part of our standard QA process and we know that any customer that's using any official tool for EDW today can move their workload to our SQL in Hadoop solution without having to compromise performance or functionality or security.

That's pretty interesting and these benchmarks are often talking points since they hide more than they show.

That is a good thing that you raised...You'll see that the SQL that we use in Vector is a standard SQL. These benchmarks typically what they are measuring is the performance at the database engine - so they usually will restrict to 100 rows that come back from the query. Those top 100 rows from three tables that we're going to join and then we've got some join conditions here and some group by and order by. If you look at... the Impala equivalent in their benchmark capability, what they actually do is they specify the partition keys that are used for solving this query. They have completely cheated in this benchmark. They know exactly where on disk the rows that are gonna satisfy the query lie and they point exactly to those rows. But even with them playing that trick, we will outperform them by thirty times. 

The benchmarks are not something that I would trust unless they came from a party like TPC and they are audited benchmarks. In this particular case that we call as the Impala subset of TPC-DS they identified the queries where they were going to out-perform Hive, they rewrote them because they have limited SQL support and then they added the partition keys. So it's a game - they cheated and I call them on their cheating when I use this presentation publicly because I think it's wrong. 

Everybody is playing games with these benchmarks and we can use them to demonstrate whatever you chose. But for us we're using standard SQL. We have clean hands in this and we also plan something for the first half of next year to publish some audited benchmark results. We're not running the TPC-DS because that benchmark is under review with TPC Council right now and it looks like they're going to deprecate it replacing it with something for Hadoop. We are going to take more traditional workload TPC-H benchmark and we are going to publish numbers in first half of next year and these will be audited benchmark numbers for TPC-H benchmark on Hadoop. We will be the first database vendor to publish audited TPC benchmarks for Hadoop. We are very excited by this because our performance is going to be, I hope, an order of magnitude greater than traditional enterprise data warehouse players. We expect to beat Teradata, Oracle, SQL Server and Vertica by an order of magnitude which is huge when you are dealing with Hadoop which is known as commodity workhorse platform but is not known for performance.

Read more »

Apache Spark turning into API haven

Apache Spark is the hottest technology around in big data. It has the most generous contributions from the open source community. But with the latest release of Apache Spark 1.6, there is clearly a pattern evolving in where it is heading. And, currently that road seems to be an API haven. (Note, the term being used here is haven and not heaven).

Spark innovations in 2015

March 2015: Spark 1.3 released.

DataFrames introduced: SchemaRDD renamed and further innovated to give rise to DataFrames. DataFrames are not just RDDs with schema but have a huge army of useful operations that could be invoked with an exhaustive API.
For some strange reason, DataFrames were decided to be more of relational nature only and so were pitched directly along with Spark SQL. A developer could use either DataFrame API or could use SQL to query relational form data which could be residing in tables or any Spark supported data source (like Parquet, JSON etc.).

Nov 2015: Spark 1.6 released.

Datasets introduced: specialized DataFrames which can operate on specific JVM object type and not just rows. Essentially Dataset uses a logical plan created by Catalyst, the algorithmic engine behind Spark SQL/DataFrame and thus can do a lot of logical operations like sorting and shuffling. In a future release, you can expect DataFrames API to change and extend Dataset. 
Nutshell, Datasets are more intelligent objects compared to vanilla RDDs and if you might have guessed it, could be the future pinning of Spark API. 
Is it time to say "bye bye RDD, welcome Dataset"?

Happy or Upset with new Spark release?

Are you sounding relieved or almost upset with new Spark release– could depend on what stage are you in Spark journey.
For an indicative sample, answer could depend on what you doing.

You could be upset if:
- You have already been using Spark API and more or less re-settled yourself on Spark 1.3.x being distributed with CDH, HDP and MapR.
- You were wishing for a more powerful and SQL-rich Spark SQL but instead see focus shifting from SQL to programming API.
- You were wishing for a tighter integration between SQL and ML rather than DataFrames API and ML Pipelines API.

You could be delighted if:
- You are a devout Cloudera Impala or Hive fan and loved their performance! You always wanted to stick with these rather than Spark SQL offering more performant in-memory analytical power.
- You were a Storm and Mahout fan and were looking for ways to avoid shifting to Spark! You now know that coding in Spark may not be that cake-walk as it was promised to be since there are frequent API changes and over-reliance on experimental API for exposing functions in the promised unified platform.
- You were building your stealth next top big data product and were suddenly delighted to figure out that it’s difficult for users to remain settled in competing Spark! What was relevant in Spark around 6 months back may require re-factoring/re-engineering now.

Overall, the request to the brilliant and brave Spark core committers is to have a rethink and start loving the world outside programming API. Spark has been given the official throne of big data execution engine. So, it now needs to settle out on the surface while it keeps to continuously innovate under the covers. Spark is no longer ‘experimental’ and the term needs to move out of its documentation and strategy. It has to be hardened, enterprise grade and long serving to the end user. That's the only way for Spark to keep shining.

Read more »