Skip to main content

Amazon DynamoDB datastore for Gora

What was initially suggested during causal conversation at ApacheCon2011 in November 2011 as a “neat idea”, would soon become prime ground for Gora's first taste of participation within Google's Summer of Code program. Initially, the project, titled Amazon DynamoDB datastore for Gora, merely aimed to extend the Gora framework to Amazon DynamoDB. However, it seem became obvious that the issue would include much more than that simple vision.


The Gora 0.3 Toolbox

We briefly digress to discuss some other noticeable additions to Gora in 0.3, namely:
  • Modification of the Query interface: The Query interface was amended from Query<K, T> to Query<K, T extends Persistent> to be more precise and explicit for developers. Consequently all implementors and users of the Query interface can only pass object's of Persistent type.
  • Logging improvements for data store mappings: A key aspect of using Gora well is the establishment and accurate definition of a suitable model for data mapping. All data stores in Gora read mappings from XML mapping descriptors. We therefore improved logging support of tables, keyspaces, attributes, values, etc across the data store modules.
  • Implementation of Bytes Type for Cassandra column family validator: Although there were a number of fixes specific to the gora-cassandra module, one particular improvement was made to implement BytesType for Cassandra column family validators. This allows us to support any of the six Avro complex data types e.g. Records, Enums, Arrays, Maps, Unions, and Fixed.
  • Improvement of thread-safety in Cassandra Store: This related to a wrong assumption about the thread safety of a LinkedHashMap when iterating over keys in insertion order. For example it is not enough to iterate over a buffer (which is a LinkedHashMap) when executing operations therefore we needed to ensure (via implementation) that the user manually synchronizes on the returned map when iterating over any of its collection views.
  • Synchronizing on certain Cassandra mutations: This issue occurred when inserting columns, sub-columns and deleting sub-columns (using Hector Client's Mutator) in highly concurrent environments. Imagine operating with 100-400k URLs during a fetch of the web, where the fetcher is running with 30 threads and 2 threads/queue. Execution operations within such simultaneous environments necessitate synchronization.
  • Finally, the improvements to the GoraCompiler: The Gora compiler is used to compile Avro schemas down into Persistent classes. The improvements included support for multiple avro schemas and functionality which allows users to optionally add license headers to their generated code.
Hopefully the above provides a taste of what kind of work went in to the 0.3 development drive, however our CHANGES.txt file for the release should be consulted for a full breakdown of the development effort.
Returning to the most significant contribution the 0.3 release, the remainder of this section discusses the dynamodb module.

Amazon DynamoDB datastore for Gora
Amazon DynamoDB is a fast, highly scalable, highly available, cost-effective, non-relational database service which removes traditional scalability limitations on data storage while maintaining low latency and predictable performance. It was envisaged that introduction of the DynamoDB module would allow users to utilize the expressiveness offered from the Gora framework in conjunction with the DynamoDB model.
Before we progress to cover the issue more comprehensively and the technical challenges it posed (please see the next section), it is important to understand in essence, some defining, inherited characteristics (and founding motivations) included within pre-Gora 0.3.
·         Gora was originally and specifically designed with NoSQL data stores in mind. For example, the API is based on <key, value> pairs, rather than just beans. The original architects also believed that the object mapping layer should be tuned for batch operations (like first class object re-use support)
·         Gora uses Avro, to generate data beans from Avro schemas. Moreover, most of the serializations are delegated to avro. For example, a map is serialized to a field (if not configured otherwise) using Avro serialization.
·         Gora provides first-class support for Hadoop MapReduce. DataStore implementations are responsible for partitioning the data (which is then converted to Hadoop Splits), and all the locality information is again obtained from the data store. Developing MapReduce jobs with Gora is really easy.
This is all great, but what happens when we consider other data models (and consequently different products/data stores) which do not have any relation or logical affiliation with Avro, or that do not work on some of the other decisions stated above? To add to this we still wanted to be able to use the same data structures to persist objects to such data stores. Oh and by the way, we also want to be able to use Pig or Cascading within Gora jobs to mine the data stored within the data stores.
Read on to see how we essentially restructured Gora's core module, the main change being the separation of Avro as persistence layer and an actual abstract persistence layer. This effectively now enables users to extend the Gora model to use DynamoDB, Google's AppEngine, Microsoft Azure, and other web services which Gora might include in the future for persistent storage and/or analysis.

Technical Challenges and Detail

The Gora API is based on <key, value> pairs, rather than just Java beans. The original architects also believed that the object mapping layer should be tuned for batch operations (like first class object re-use support).
As we mentioned above, Gora uses Avro to generate data beans from Avro schemas (avsc files). Moreover, traditionally most of the serializations from across across Gora modules are delegated to Avro due its performance advantages and recognition within the data serialization space.
Justification behind this architecture decision was to provide a simple but direct way of mapping user input schema(s) into objects which can be persisted via Gora's API.
For example, a user can simply define the object to store by creating a file containing the required schema using Avro - JSON notation:

  {
    "type": "record",
    "name": "User",
    "namespace": "org.apache.gora.examples.generated",
    "fields" : [
      {"name": "firstname", "type": "string"},
      {"name": "lastname", "type": "string"},
      {"name": "password", "type": "string"}
    ]
  }

More complex data types can also be serialized within supported data stores e.g. a Map is serialized to a field (if not configured otherwise) using Avro.
One noticeable problem with this architecture and representation though is Gora's direct dependence on disk-based serialization using Avro. When we embarked upon writing the gora-dynamodb module, there were several incompatibilities attributable to the initial Avro architecture. It was therefore essential to create an extra layer for persisting data into non-disk-backed systems such as web-service-backed data stores.
This was overcome by creating a clean abstraction for Gora's Persistence layer. The objective of using this Persistent type is to allow developers to re-use previously generated data beans across any of the supported data stores easily enabling them to extend their existing Gora systems to utilize the web-services API as well.
Using web-service modules should be as easy as defining the schema to be persisted, and setting the necessary credentials into the default gora.properties file. Right now, the new gora-dynamodb module is accompanied by its own data compiler (modeled on Avro's SpecificCompiler but implementing annotations and special requirements as used within the Amazon SDK and the Amazon platform itself) to produce its data beans. In the future we expect different (new) data stores to follow this trend having their own compiler which can be used by running a main compiler script with different options. Hopefully this has given a flavor of how existing Gora implementations can be extended to use Gora to persist their data into the cloud and also how new developers can extend this model to suit dynamic requirements for their big data. In the next section we close with a brief discussion on what the future holds for Gora through 2013 and beyond.

Gora >= 0.4 Roadmap


Over the last few years Gora has grown in popularity and will be participating in numerous Google Summer of Code (GSoC) projects this year building on the success we enjoyed at last years program. The first project planned for this year aims to use Cascading as an alternative data processing paradigm to MapReduce. Cascading is a Java application framework that enables typical developers to quickly and easily develop rich data analytics and data management applications that can be deployed and managed across a variety of computing environments. Cascading works seamlessly with Apache Hadoop 1.0 and API compatible distributions. In this respect we are really looking forward to enriching Gora's capability functionality to not only write, read and persist data, but also to process huge amounts of data on the fly.
The second GSoC project proposes to add another data store; Oracle-NoSQL. The Oracle NoSQL database is a distributed key-value database. It is designed to provide highly reliable, scalable and available data storage across a configurable set of systems that function as storage nodes. As one will have guessed, this will fit in perfectly with our vision to make Gora the number one persistence layer for big data. Additionally, it will help Gora to keep the community momentum going by supporting another popular data store within the market place. Finally, Gora is featuring an integration with Apache Giraph which is an iterative graph processing system built for high scalability. The project aims to provide Giraph with different mechanisms to persist and retrieve data into Gora datastores.

And Finally...

Want to know more about Gora? Easy, just head over to our downloads page and check out the most recent 0.3 stable release.
Want to know more about Gora including the most up-to-date news within the community? It's all linked to from our main site. You may wanted to get started with our Log Manager tutorial!


1st part of the post:

About the Authors:

Renato Marroquin is a Computer Science Master by the Pontifical University of Rio de Janeiro with the thesis titled "Experimental Statistical Analysis of MapReduce Jobs". He is currently a Computer Science Professor at Universidad Catolica San Pablo in Arequipa, Peru and also an Apache Gora PMC Member and Committer, Open Source and Big Data Enthusiast.

Lewis holds his PhD in Legislative Informatics from Glasgow Caledonian University in Glasgow Scotland. He is currently a Post Doctoral Research Scholar at Stanford University, CA ,a member of the Apache Software Foundation and Apache Committer at several Apache projects including Gora where he is VP.


Comments

  1. Please feel free to post any comments on this post here or alternatively on dev@gora.apache.org
    Thank you
    Lewis

    ReplyDelete

Post a Comment

Popular posts from this blog

5 online tools in data visualization playground

While building up an analytics dashboard, one of the major decision points is regarding the type of charts and graphs that would provide better insight into the data. To avoid a lot of re-work later, it makes sense to try the various chart options during the requirement and design phase. It is probably a well known myth that existing tool options in any product can serve all the user requirements with just minor configuration changes. We all know and realize that code needs to be written to serve each customer’s individual needs.
To that effect, here are 5 tools that could empower your technical and business teams to decide on visualization options during the requirement phase. Listed below are online tools for you to add data and use as playground.
1)      Many Eyes: Many Eyes is a data visualization experiment by IBM Researchandthe IBM Cognos software group. This tool provides option to upload data sets and create visualizations including Scatter Plot, Tree Map, Tag/Word cloud and ge…

Data deduplication tactics with HDFS and MapReduce

As the amount of data continues to grow exponentially, there has been increased focus on stored data reduction methods. Data compression, single instance store and data deduplication are among the common techniques employed for stored data reduction.
Deduplication often refers to elimination of redundant subfiles (also known as chunks, blocks, or extents). Unlike compression, data is not changed and eliminates storage capacity for identical data. Data deduplication offers significant advantage in terms of reduction in storage, network bandwidth and promises increased scalability.
From a simplistic use case perspective, we can see application in removing duplicates in Call Detail Record (CDR) for a Telecom carrier. Similarly, we may apply the technique to optimize on network traffic carrying the same data packets.
Some of the common methods for data deduplication in storage architecture include hashing, binary comparison and delta differencing. In this post, we focus on how MapReduce and…

Pricing models for Hadoop products

A look at the various pricing models adopted by the vendors in the Hadoop ecosystem. While the pricing models are evolving in this rapid and dynamic market, listed below are some of the major variations utilized by companies in the sphere.
1) Per Node:Among the most common model, the node based pricing mechanism utilizes customized rules for determining pricing per node. This may be as straight forward as pricing per name node and data node or could have complex variants of pricing based on number of core processors utilized by the nodes in the cluster or per user license in case of applications.
2) Per TB:The data based pricing mechanism charges customer for license cost per TB of data. This model usually accounts non replicated data for computation of cost.
3) Subscription Support cost only:In this model, the vendor prefers to give away software for free but charges the customer for subscription support on a specified number of nodes. The support timings and level of support further …