Skip to main content

Exploring varied data stores with Apache MetaModel

With the wide prevalence of multiple data stores including relational, NoSQL and unstructured formats, it becomes natural to look for a library which can act as a common exploration and querying connector. Apache MetaModel is one such library which aims to provide a common interface for discovery, metadata exploration and querying of different types of data sources. With MetaModel, the user can currently query CouchDB, MongoDB, HBase, Cassandra, MySQL, Oracle DB and SQL Server among others.

"MetaModel is a library that encapsulates the differences and enhances the capabilities of different data stores. Rich querying abilities are offered to data stores that do not otherwise support advanced querying and a unified view of the data store structure is offered through a single model of the schema, tables, columns and relationships."


HadoopSphere discussed with Kasper Sorensen, VP of Apache MetaModel on the functional fitment and features of Apache MetaModel. Here is what Kasper had to say about this interesting product.

Please help us understand the purpose of Apache MetaModel and which use cases can Apache MetaModel fit in?


MetaModel was designed to make connectivity across very different types of databases, data file formats and likewise possible with just one uniform approach. We wanted an API where you can explore and query a data source using exactly the same codebase regardless of the source being an Excel spreadsheet, a relational database or something completely different.

There are quite a lot of frameworks out there that do this. But they more or less have the requirement that you need to map your source to some domain model and then you end up actually querying the domain model, not the data source itself. So if your user is the guy that is interested in the Excel sheet or the database table itself then he cannot directly relate his data with what he is getting from the framework. This is why we used the name ‘MetaModel’ – we present data based on a metadata model of the source, not based on a domain model mapping approach.

How does Apache MetaModel work?


It is important for us that MetaModel have very little infrastructure requirements - so you don’t have to use any particular container type or dependency injection framework. MetaModel is plain java oriented and if you want to use it you just instantiate objects, call methods etc.

Whenever you want to interact with data in MetaModel you need an object that implements the DataContext interface. A DataContext is a bit like a “Connection” to a database. The DataContext exposes methods to explore the metadata (schemas, tables etc.) as well as to query the actual data. The Query API is preferably type-safe Java, linked to the metadata objects, but you can also express your query as SQL if you like.

Depending on the type of DataContext, we have a few different ways of working. For SQL databases we of course want to delegate more or less the whole query to the database. On other source types we can only delegate some parts of the query to the underlying database. For instance, if you apply a GROUP BY operation to a NoSQL database, then we usually have to do the grouping part ourselves. For that we have a pluggable query engine. Finally some source types, such as CSV files or XML documents, do not have a query engine already and we wrap our own query engine around them.

Some sources can also be updated. We offer a functional interface where you pass a UpdateScript object that does required update, when it is possible according to the underlying source and with the transactional features that it may or may not support.

Give us a snapshot of what is Apache MetaModel competing with - both in open source and commercial ecosystem?


I don’t think there’s anything out there that is really a LOT like MetaModel. But there are obviously some typical frameworks that has a hint of the same.

JPA, Hibernate and so on are similar in the way that they are essentially abstracting away the underlying storage technology. But they are very different in the sense that they are modelled around a domain model, not the data source itself.

LINQ (for .NET) has a lot of similarities with MetaModel. Obviously the platform is different though and the syntax of LINQ is superior to anything we can achieve as being “just” a library. On the plus-side for MetaModel, I believe we have one of the easiest interfaces to implement if you want to make your own adaptor.

What lies ahead on the roadmap for Apache MetaModel in 2015?


We are in a period of taking small steps so that we get a feel of what the community wants. For example we just made a release where we added write/update support for our ElasticSearch module.

So the long-term roadmap is not really set in stone. But we do always want to expand the portfolio of supported data stores. I personally also would like to see MetaModel used in a few other Apache projects so maybe we need to work outside of our own community, engaging with others as well.

Why would a user explore metadata using Apache MetaModel and not connect to various data stores directly?


If you only need to connect to one data store, which already has a query engine and all - then you don’t have to use MetaModel. A key strength in MetaModel is the uniformed access to multiple data stores and similarly a weakness is in utilizing all the functionality of a source. We do have great metrics overall, but if you’re chasing to optimize the use of just one source then you can typically receive better results by going directly to it.

Another reason might be to get a query API for things such as CSV files, Spreadsheets and so on, which normally have no query capabilities. MetaModel will provide you with a convenient shortcut there.

Also testability is a strong point of MetaModel. You may write code to interact with your database, but simply test it using a POJO data store (an in-memory Java collection structure) which is fast and light-weight.

It seems Apache MetaModel does not support HDFS though it supports HBase. Any specific reason for that?


Not really, except it wasn’t requested by the community yet. But we do have interfaces that quite easily let you use e.g. the CsvDataContext on a resource (file) in HDFS. In fact a colleague of mine did that already for a separate project where a MetaModel-based application was applied to Hadoop.

My guess to answer “why” this is less interesting is that if you’re a HDFS user then you typically have such an amount of data that you anyways don’t want to use a querying framework (as MetaModel) but rather want a processing framework (such as MapReduce or so) in order to handle it.

Does Apache MetaModel support polyglot operations with multiple data stores?


Yes, a common thing that our users ask is stuff like “I have a CSV file which contains keys that are also represented in my database – can I join them?” … And with MetaModel you absolutely can. We offer a class called CompositeDataContext which basically lets you query (and explore for that matter) multiple data stores as if they were one.



Kasper Sorensen is the VP of Apache MetaModel and in his daily life he works as Principal Tech Lead at Human Inference, a Neopost company. The main products in his portfolio are Apache MetaModel and the open source Data Quality toolkit DataCleaner.

Comments

Popular posts from this blog

Data deduplication tactics with HDFS and MapReduce

As the amount of data continues to grow exponentially, there has been increased focus on stored data reduction methods. Data compression, single instance store and data deduplication are among the common techniques employed for stored data reduction.
Deduplication often refers to elimination of redundant subfiles (also known as chunks, blocks, or extents). Unlike compression, data is not changed and eliminates storage capacity for identical data. Data deduplication offers significant advantage in terms of reduction in storage, network bandwidth and promises increased scalability.
From a simplistic use case perspective, we can see application in removing duplicates in Call Detail Record (CDR) for a Telecom carrier. Similarly, we may apply the technique to optimize on network traffic carrying the same data packets.
Some of the common methods for data deduplication in storage architecture include hashing, binary comparison and delta differencing. In this post, we focus on how MapReduce and…

Low latency SQL querying on HBase

HBase has emerged as one of the most popular NoSQL database offering distributed, versioned, non-relational tables hosted on commodity hardware. However, with a large set of users coming from a relational SQL world, it made sense to bring the SQL back in this NoSQL. With Apache Phoenix, database professionals get a convenient way to query HBase through SQL in a fast and efficient manner. Continuing our discussion with James Taylor, the founder of Apache Phoenix, we focus on the functional aspects of Phoenix in this second part of interaction.
Although Apache Phoenix started off with distinct low latency advantage, have the other options like Hive/Impala (integrated with HBase) caught up in terms of performance?
No, these other tools such as Hive and Impala have not invested in improving performance against HBase data, so if anything, Phoenix's advantage has only gotten bigger as our performance improves.  See this link for comparison of Apache Phoenix with Apache Hive and Cloudera Im…

Pricing models for Hadoop products

A look at the various pricing models adopted by the vendors in the Hadoop ecosystem. While the pricing models are evolving in this rapid and dynamic market, listed below are some of the major variations utilized by companies in the sphere.
1) Per Node:Among the most common model, the node based pricing mechanism utilizes customized rules for determining pricing per node. This may be as straight forward as pricing per name node and data node or could have complex variants of pricing based on number of core processors utilized by the nodes in the cluster or per user license in case of applications.
2) Per TB:The data based pricing mechanism charges customer for license cost per TB of data. This model usually accounts non replicated data for computation of cost.
3) Subscription Support cost only:In this model, the vendor prefers to give away software for free but charges the customer for subscription support on a specified number of nodes. The support timings and level of support further …