With the wide prevalence of multiple data stores including relational, NoSQL and unstructured formats, it becomes natural to look for a library which can act as a common exploration and querying connector. Apache MetaModel is one such library which aims to provide a common interface for discovery, metadata exploration and querying of different types of data sources. With MetaModel, the user can currently query CouchDB, MongoDB, HBase, Cassandra, MySQL, Oracle DB and SQL Server among others.
"MetaModel is a library that encapsulates the differences and enhances the capabilities of different data stores. Rich querying abilities are offered to data stores that do not otherwise support advanced querying and a unified view of the data store structure is offered through a single model of the schema, tables, columns and relationships."
HadoopSphere discussed with Kasper Sorensen, VP of Apache MetaModel on the functional fitment and features of Apache MetaModel. Here is what Kasper had to say about this interesting product.
Please help us understand the purpose of Apache MetaModel and which use cases can Apache MetaModel fit in?
MetaModel was designed to make connectivity across very different types of databases, data file formats and likewise possible with just one uniform approach. We wanted an API where you can explore and query a data source using exactly the same codebase regardless of the source being an Excel spreadsheet, a relational database or something completely different.
There are quite a lot of frameworks out there that do this. But they more or less have the requirement that you need to map your source to some domain model and then you end up actually querying the domain model, not the data source itself. So if your user is the guy that is interested in the Excel sheet or the database table itself then he cannot directly relate his data with what he is getting from the framework. This is why we used the name ‘MetaModel’ – we present data based on a metadata model of the source, not based on a domain model mapping approach.
How does Apache MetaModel work?
It is important for us that MetaModel have very little infrastructure requirements - so you don’t have to use any particular container type or dependency injection framework. MetaModel is plain java oriented and if you want to use it you just instantiate objects, call methods etc.
Whenever you want to interact with data in MetaModel you need an object that implements the DataContext interface. A DataContext is a bit like a “Connection” to a database. The DataContext exposes methods to explore the metadata (schemas, tables etc.) as well as to query the actual data. The Query API is preferably type-safe Java, linked to the metadata objects, but you can also express your query as SQL if you like.
Depending on the type of DataContext, we have a few different ways of working. For SQL databases we of course want to delegate more or less the whole query to the database. On other source types we can only delegate some parts of the query to the underlying database. For instance, if you apply a GROUP BY operation to a NoSQL database, then we usually have to do the grouping part ourselves. For that we have a pluggable query engine. Finally some source types, such as CSV files or XML documents, do not have a query engine already and we wrap our own query engine around them.
Some sources can also be updated. We offer a functional interface where you pass a UpdateScript object that does required update, when it is possible according to the underlying source and with the transactional features that it may or may not support.
Give us a snapshot of what is Apache MetaModel competing with - both in open source and commercial ecosystem?
I don’t think there’s anything out there that is really a LOT like MetaModel. But there are obviously some typical frameworks that has a hint of the same.
JPA, Hibernate and so on are similar in the way that they are essentially abstracting away the underlying storage technology. But they are very different in the sense that they are modelled around a domain model, not the data source itself.
LINQ (for .NET) has a lot of similarities with MetaModel. Obviously the platform is different though and the syntax of LINQ is superior to anything we can achieve as being “just” a library. On the plus-side for MetaModel, I believe we have one of the easiest interfaces to implement if you want to make your own adaptor.
What lies ahead on the roadmap for Apache MetaModel in 2015?
We are in a period of taking small steps so that we get a feel of what the community wants. For example we just made a release where we added write/update support for our ElasticSearch module.
So the long-term roadmap is not really set in stone. But we do always want to expand the portfolio of supported data stores. I personally also would like to see MetaModel used in a few other Apache projects so maybe we need to work outside of our own community, engaging with others as well.
Why would a user explore metadata using Apache MetaModel and not connect to various data stores directly?
If you only need to connect to one data store, which already has a query engine and all - then you don’t have to use MetaModel. A key strength in MetaModel is the uniformed access to multiple data stores and similarly a weakness is in utilizing all the functionality of a source. We do have great metrics overall, but if you’re chasing to optimize the use of just one source then you can typically receive better results by going directly to it.
Another reason might be to get a query API for things such as CSV files, Spreadsheets and so on, which normally have no query capabilities. MetaModel will provide you with a convenient shortcut there.
Also testability is a strong point of MetaModel. You may write code to interact with your database, but simply test it using a POJO data store (an in-memory Java collection structure) which is fast and light-weight.
It seems Apache MetaModel does not support HDFS though it supports HBase. Any specific reason for that?
Not really, except it wasn’t requested by the community yet. But we do have interfaces that quite easily let you use e.g. the CsvDataContext on a resource (file) in HDFS. In fact a colleague of mine did that already for a separate project where a MetaModel-based application was applied to Hadoop.
My guess to answer “why” this is less interesting is that if you’re a HDFS user then you typically have such an amount of data that you anyways don’t want to use a querying framework (as MetaModel) but rather want a processing framework (such as MapReduce or so) in order to handle it.
Does Apache MetaModel support polyglot operations with multiple data stores?
Yes, a common thing that our users ask is stuff like “I have a CSV file which contains keys that are also represented in my database – can I join them?” … And with MetaModel you absolutely can. We offer a class called CompositeDataContext which basically lets you query (and explore for that matter) multiple data stores as if they were one.