Skip to main content

Exploring varied data stores with Apache MetaModel

With the wide prevalence of multiple data stores including relational, NoSQL and unstructured formats, it becomes natural to look for a library which can act as a common exploration and querying connector. Apache MetaModel is one such library which aims to provide a common interface for discovery, metadata exploration and querying of different types of data sources. With MetaModel, the user can currently query CouchDB, MongoDB, HBase, Cassandra, MySQL, Oracle DB and SQL Server among others.

"MetaModel is a library that encapsulates the differences and enhances the capabilities of different data stores. Rich querying abilities are offered to data stores that do not otherwise support advanced querying and a unified view of the data store structure is offered through a single model of the schema, tables, columns and relationships."

HadoopSphere discussed with Kasper Sorensen, VP of Apache MetaModel on the functional fitment and features of Apache MetaModel. Here is what Kasper had to say about this interesting product.

Please help us understand the purpose of Apache MetaModel and which use cases can Apache MetaModel fit in?

MetaModel was designed to make connectivity across very different types of databases, data file formats and likewise possible with just one uniform approach. We wanted an API where you can explore and query a data source using exactly the same codebase regardless of the source being an Excel spreadsheet, a relational database or something completely different.

There are quite a lot of frameworks out there that do this. But they more or less have the requirement that you need to map your source to some domain model and then you end up actually querying the domain model, not the data source itself. So if your user is the guy that is interested in the Excel sheet or the database table itself then he cannot directly relate his data with what he is getting from the framework. This is why we used the name ‘MetaModel’ – we present data based on a metadata model of the source, not based on a domain model mapping approach.

How does Apache MetaModel work?

It is important for us that MetaModel have very little infrastructure requirements - so you don’t have to use any particular container type or dependency injection framework. MetaModel is plain java oriented and if you want to use it you just instantiate objects, call methods etc.

Whenever you want to interact with data in MetaModel you need an object that implements the DataContext interface. A DataContext is a bit like a “Connection” to a database. The DataContext exposes methods to explore the metadata (schemas, tables etc.) as well as to query the actual data. The Query API is preferably type-safe Java, linked to the metadata objects, but you can also express your query as SQL if you like.

Depending on the type of DataContext, we have a few different ways of working. For SQL databases we of course want to delegate more or less the whole query to the database. On other source types we can only delegate some parts of the query to the underlying database. For instance, if you apply a GROUP BY operation to a NoSQL database, then we usually have to do the grouping part ourselves. For that we have a pluggable query engine. Finally some source types, such as CSV files or XML documents, do not have a query engine already and we wrap our own query engine around them.

Some sources can also be updated. We offer a functional interface where you pass a UpdateScript object that does required update, when it is possible according to the underlying source and with the transactional features that it may or may not support.

Give us a snapshot of what is Apache MetaModel competing with - both in open source and commercial ecosystem?

I don’t think there’s anything out there that is really a LOT like MetaModel. But there are obviously some typical frameworks that has a hint of the same.

JPA, Hibernate and so on are similar in the way that they are essentially abstracting away the underlying storage technology. But they are very different in the sense that they are modelled around a domain model, not the data source itself.

LINQ (for .NET) has a lot of similarities with MetaModel. Obviously the platform is different though and the syntax of LINQ is superior to anything we can achieve as being “just” a library. On the plus-side for MetaModel, I believe we have one of the easiest interfaces to implement if you want to make your own adaptor.

What lies ahead on the roadmap for Apache MetaModel in 2015?

We are in a period of taking small steps so that we get a feel of what the community wants. For example we just made a release where we added write/update support for our ElasticSearch module.

So the long-term roadmap is not really set in stone. But we do always want to expand the portfolio of supported data stores. I personally also would like to see MetaModel used in a few other Apache projects so maybe we need to work outside of our own community, engaging with others as well.

Why would a user explore metadata using Apache MetaModel and not connect to various data stores directly?

If you only need to connect to one data store, which already has a query engine and all - then you don’t have to use MetaModel. A key strength in MetaModel is the uniformed access to multiple data stores and similarly a weakness is in utilizing all the functionality of a source. We do have great metrics overall, but if you’re chasing to optimize the use of just one source then you can typically receive better results by going directly to it.

Another reason might be to get a query API for things such as CSV files, Spreadsheets and so on, which normally have no query capabilities. MetaModel will provide you with a convenient shortcut there.

Also testability is a strong point of MetaModel. You may write code to interact with your database, but simply test it using a POJO data store (an in-memory Java collection structure) which is fast and light-weight.

It seems Apache MetaModel does not support HDFS though it supports HBase. Any specific reason for that?

Not really, except it wasn’t requested by the community yet. But we do have interfaces that quite easily let you use e.g. the CsvDataContext on a resource (file) in HDFS. In fact a colleague of mine did that already for a separate project where a MetaModel-based application was applied to Hadoop.

My guess to answer “why” this is less interesting is that if you’re a HDFS user then you typically have such an amount of data that you anyways don’t want to use a querying framework (as MetaModel) but rather want a processing framework (such as MapReduce or so) in order to handle it.

Does Apache MetaModel support polyglot operations with multiple data stores?

Yes, a common thing that our users ask is stuff like “I have a CSV file which contains keys that are also represented in my database – can I join them?” … And with MetaModel you absolutely can. We offer a class called CompositeDataContext which basically lets you query (and explore for that matter) multiple data stores as if they were one.

Kasper Sorensen is the VP of Apache MetaModel and in his daily life he works as Principal Tech Lead at Human Inference, a Neopost company. The main products in his portfolio are Apache MetaModel and the open source Data Quality toolkit DataCleaner.


  1. @Kasper Sorensen It's a nice article.

    But I have query here, as per my observations I didn't find UpdatableDataContext implementations for HBase and Cassandra. Any reasons behind this?

    Is there any other possibility to push my data to HBase and Cassandra?


Post a Comment

Popular articles

5 online tools in data visualization playground

While building up an analytics dashboard, one of the major decision points is regarding the type of charts and graphs that would provide better insight into the data. To avoid a lot of re-work later, it makes sense to try the various chart options during the requirement and design phase. It is probably a well known myth that existing tool options in any product can serve all the user requirements with just minor configuration changes. We all know and realize that code needs to be written to serve each customer’s individual needs. To that effect, here are 5 tools that could empower your technical and business teams to decide on visualization options during the requirement phase. Listed below are online tools for you to add data and use as playground. 1)      Many Eyes : Many Eyes is a data visualization experiment by IBM Research and the IBM Cognos software group. This tool provides option to upload data sets and create visualizations including Scatter Plot, Tree Ma

Data deduplication tactics with HDFS and MapReduce

As the amount of data continues to grow exponentially, there has been increased focus on stored data reduction methods. Data compression, single instance store and data deduplication are among the common techniques employed for stored data reduction. Deduplication often refers to elimination of redundant subfiles (also known as chunks, blocks, or extents). Unlike compression, data is not changed and eliminates storage capacity for identical data. Data deduplication offers significant advantage in terms of reduction in storage, network bandwidth and promises increased scalability. From a simplistic use case perspective, we can see application in removing duplicates in Call Detail Record (CDR) for a Telecom carrier. Similarly, we may apply the technique to optimize on network traffic carrying the same data packets. Some of the common methods for data deduplication in storage architecture include hashing, binary comparison and delta differencing. In this post, we focus o

In-memory data model with Apache Gora

Open source in-memory data model and persistence for big data framework Apache Gora™ version 0.3, was released in May 2013. The 0.3 release offers significant improvements and changes to a number of modules including a number of bug fixes. However, what may be of significant interest to the DynamoDB community will be the addition of a gora-dynamodb datastore for mapping and persisting objects to Amazon's DynamoDB . Additionally the release includes various improvements to the gora-core and gora-cassandra modules as well as a new Web Services API implementation which enables users to extend Gora to any cloud storage platform of their choice. This 2-part post provides commentary on all of the above and a whole lot more, expanding to cover where Gora fits in within the NoSQL and Big Data space, the development challenges and features which have been baked into Gora 0.3 and finally what we have on the road map for the 0.4 development drive. Introducing Apache Gora Although