Skip to main content

Exploring varied data stores with Apache MetaModel

With the wide prevalence of multiple data stores including relational, NoSQL and unstructured formats, it becomes natural to look for a library which can act as a common exploration and querying connector. Apache MetaModel is one such library which aims to provide a common interface for discovery, metadata exploration and querying of different types of data sources. With MetaModel, the user can currently query CouchDB, MongoDB, HBase, Cassandra, MySQL, Oracle DB and SQL Server among others.

"MetaModel is a library that encapsulates the differences and enhances the capabilities of different data stores. Rich querying abilities are offered to data stores that do not otherwise support advanced querying and a unified view of the data store structure is offered through a single model of the schema, tables, columns and relationships."

HadoopSphere discussed with Kasper Sorensen, VP of Apache MetaModel on the functional fitment and features of Apache MetaModel. Here is what Kasper had to say about this interesting product.

Please help us understand the purpose of Apache MetaModel and which use cases can Apache MetaModel fit in?

MetaModel was designed to make connectivity across very different types of databases, data file formats and likewise possible with just one uniform approach. We wanted an API where you can explore and query a data source using exactly the same codebase regardless of the source being an Excel spreadsheet, a relational database or something completely different.

There are quite a lot of frameworks out there that do this. But they more or less have the requirement that you need to map your source to some domain model and then you end up actually querying the domain model, not the data source itself. So if your user is the guy that is interested in the Excel sheet or the database table itself then he cannot directly relate his data with what he is getting from the framework. This is why we used the name ‘MetaModel’ – we present data based on a metadata model of the source, not based on a domain model mapping approach.

How does Apache MetaModel work?

It is important for us that MetaModel have very little infrastructure requirements - so you don’t have to use any particular container type or dependency injection framework. MetaModel is plain java oriented and if you want to use it you just instantiate objects, call methods etc.

Whenever you want to interact with data in MetaModel you need an object that implements the DataContext interface. A DataContext is a bit like a “Connection” to a database. The DataContext exposes methods to explore the metadata (schemas, tables etc.) as well as to query the actual data. The Query API is preferably type-safe Java, linked to the metadata objects, but you can also express your query as SQL if you like.

Depending on the type of DataContext, we have a few different ways of working. For SQL databases we of course want to delegate more or less the whole query to the database. On other source types we can only delegate some parts of the query to the underlying database. For instance, if you apply a GROUP BY operation to a NoSQL database, then we usually have to do the grouping part ourselves. For that we have a pluggable query engine. Finally some source types, such as CSV files or XML documents, do not have a query engine already and we wrap our own query engine around them.

Some sources can also be updated. We offer a functional interface where you pass a UpdateScript object that does required update, when it is possible according to the underlying source and with the transactional features that it may or may not support.

Give us a snapshot of what is Apache MetaModel competing with - both in open source and commercial ecosystem?

I don’t think there’s anything out there that is really a LOT like MetaModel. But there are obviously some typical frameworks that has a hint of the same.

JPA, Hibernate and so on are similar in the way that they are essentially abstracting away the underlying storage technology. But they are very different in the sense that they are modelled around a domain model, not the data source itself.

LINQ (for .NET) has a lot of similarities with MetaModel. Obviously the platform is different though and the syntax of LINQ is superior to anything we can achieve as being “just” a library. On the plus-side for MetaModel, I believe we have one of the easiest interfaces to implement if you want to make your own adaptor.

What lies ahead on the roadmap for Apache MetaModel in 2015?

We are in a period of taking small steps so that we get a feel of what the community wants. For example we just made a release where we added write/update support for our ElasticSearch module.

So the long-term roadmap is not really set in stone. But we do always want to expand the portfolio of supported data stores. I personally also would like to see MetaModel used in a few other Apache projects so maybe we need to work outside of our own community, engaging with others as well.

Why would a user explore metadata using Apache MetaModel and not connect to various data stores directly?

If you only need to connect to one data store, which already has a query engine and all - then you don’t have to use MetaModel. A key strength in MetaModel is the uniformed access to multiple data stores and similarly a weakness is in utilizing all the functionality of a source. We do have great metrics overall, but if you’re chasing to optimize the use of just one source then you can typically receive better results by going directly to it.

Another reason might be to get a query API for things such as CSV files, Spreadsheets and so on, which normally have no query capabilities. MetaModel will provide you with a convenient shortcut there.

Also testability is a strong point of MetaModel. You may write code to interact with your database, but simply test it using a POJO data store (an in-memory Java collection structure) which is fast and light-weight.

It seems Apache MetaModel does not support HDFS though it supports HBase. Any specific reason for that?

Not really, except it wasn’t requested by the community yet. But we do have interfaces that quite easily let you use e.g. the CsvDataContext on a resource (file) in HDFS. In fact a colleague of mine did that already for a separate project where a MetaModel-based application was applied to Hadoop.

My guess to answer “why” this is less interesting is that if you’re a HDFS user then you typically have such an amount of data that you anyways don’t want to use a querying framework (as MetaModel) but rather want a processing framework (such as MapReduce or so) in order to handle it.

Does Apache MetaModel support polyglot operations with multiple data stores?

Yes, a common thing that our users ask is stuff like “I have a CSV file which contains keys that are also represented in my database – can I join them?” … And with MetaModel you absolutely can. We offer a class called CompositeDataContext which basically lets you query (and explore for that matter) multiple data stores as if they were one.

Kasper Sorensen is the VP of Apache MetaModel and in his daily life he works as Principal Tech Lead at Human Inference, a Neopost company. The main products in his portfolio are Apache MetaModel and the open source Data Quality toolkit DataCleaner.


Popular posts from this blog

Hadoop's 10 in LinkedIn's 10

LinkedIn, the pioneering professional social network has turned 10 years old. One of the hallmarks of its journey has been its technical accomplishments and significant contribution to open source, particularly in the last few years. Hadoop occupies a central place in its technical environment powering some of the most used features of desktop and mobile app. As LinkedIn enters the second decade of its existence, here is a look at 10 major projects and products powered by Hadoop in its data ecosystem.
1)      Voldemort:Arguably, the most famous export of LinkedIn engineering, Voldemort is a distributed key-value storage system. Named after an antagonist in Harry Potter series and influenced by Amazon’s Dynamo DB, the wizardry in this database extends to its self healing features. Available in HA configuration, its layered, pluggable architecture implementations are being used for both read and read-write use cases.
2)      Azkaban:A batch job scheduling system with a friendly UI, Azkab…

Data deduplication tactics with HDFS and MapReduce

As the amount of data continues to grow exponentially, there has been increased focus on stored data reduction methods. Data compression, single instance store and data deduplication are among the common techniques employed for stored data reduction.
Deduplication often refers to elimination of redundant subfiles (also known as chunks, blocks, or extents). Unlike compression, data is not changed and eliminates storage capacity for identical data. Data deduplication offers significant advantage in terms of reduction in storage, network bandwidth and promises increased scalability.
From a simplistic use case perspective, we can see application in removing duplicates in Call Detail Record (CDR) for a Telecom carrier. Similarly, we may apply the technique to optimize on network traffic carrying the same data packets.
Some of the common methods for data deduplication in storage architecture include hashing, binary comparison and delta differencing. In this post, we focus on how MapReduce and…

Top Big Data Influencers of 2015

2015 was an exciting year for big data and hadoop ecosystem. We saw hadoop becoming an essential part of data management strategy of almost all major enterprise organizations. There is cut throat competition among IT vendors now to help realize the vision of data hub, data lake and data warehouse with Hadoop and Spark.
As part of its annual assessment of big data and hadoop ecosystem, HadoopSphere publishes a list of top big data influencers each year. The list is derived based on a scientific methodology which involves assessing various parameters in each category of influencers. HadoopSphere Top Big Data Influencers list reflects the people, products, organizations and portals that exercised the most influence on big data and ecosystem in a particular year. The influencers have been listed in the following categories:

AnalystsSocial MediaOnline MediaProductsTechiesCoachThought LeadersClick here to read the methodology used.

Analysts:Doug HenschenIt might have been hard to miss Doug…