What was initially suggested during causal conversation at
ApacheCon2011 in November 2011 as a “neat idea”, would soon become prime ground
for Gora's first taste of participation
within Google's Summer of Code program. Initially, the project, titled Amazon
DynamoDB datastore for Gora, merely aimed to extend the Gora framework to
Amazon DynamoDB. However, it seem became obvious that the issue would include
much more than that simple vision.
The Gora 0.3 Toolbox
We
briefly digress to discuss some other noticeable additions to Gora in 0.3,
namely:
- Modification of the Query interface: The Query interface was amended from Query<K, T> to Query<K, T extends Persistent> to be more precise and explicit for developers. Consequently all implementors and users of the Query interface can only pass object's of Persistent type.
- Logging improvements for data store mappings: A key aspect of using Gora well is the establishment and accurate definition of a suitable model for data mapping. All data stores in Gora read mappings from XML mapping descriptors. We therefore improved logging support of tables, keyspaces, attributes, values, etc across the data store modules.
- Implementation of Bytes Type for Cassandra column family validator: Although there were a number of fixes specific to the gora-cassandra module, one particular improvement was made to implement BytesType for Cassandra column family validators. This allows us to support any of the six Avro complex data types e.g. Records, Enums, Arrays, Maps, Unions, and Fixed.
- Improvement of thread-safety in Cassandra Store: This related to a wrong assumption about the thread safety of a LinkedHashMap when iterating over keys in insertion order. For example it is not enough to iterate over a buffer (which is a LinkedHashMap) when executing operations therefore we needed to ensure (via implementation) that the user manually synchronizes on the returned map when iterating over any of its collection views.
- Synchronizing on certain Cassandra mutations: This issue occurred when inserting columns, sub-columns and deleting sub-columns (using Hector Client's Mutator) in highly concurrent environments. Imagine operating with 100-400k URLs during a fetch of the web, where the fetcher is running with 30 threads and 2 threads/queue. Execution operations within such simultaneous environments necessitate synchronization.
- Finally, the improvements to the GoraCompiler: The Gora compiler is used to compile Avro schemas down into Persistent classes. The improvements included support for multiple avro schemas and functionality which allows users to optionally add license headers to their generated code.
Hopefully
the above provides a taste of what kind of work went in to the 0.3 development
drive, however our CHANGES.txt file for the release should be
consulted for a full breakdown of the development effort.
Returning
to the most significant contribution the 0.3 release, the remainder of this
section discusses the dynamodb module.
Amazon
DynamoDB datastore for Gora
Amazon
DynamoDB is a fast, highly scalable, highly available, cost-effective,
non-relational database service which removes traditional scalability
limitations on data storage while maintaining low latency and predictable
performance. It was envisaged that introduction of the DynamoDB module would
allow users to utilize the expressiveness offered from the Gora framework in
conjunction with the DynamoDB model.
Before
we progress to cover the issue more comprehensively and the technical
challenges it posed (please see the next section), it is important to understand
in essence, some defining, inherited characteristics (and founding motivations)
included within pre-Gora 0.3.
·
Gora was originally and specifically designed with NoSQL
data stores in mind. For example, the API is based on <key, value> pairs,
rather than just beans. The original architects also believed that the object
mapping layer should be tuned for batch operations (like first class object
re-use support)
·
Gora uses Avro, to generate data beans from Avro schemas.
Moreover, most of the serializations are delegated to avro. For example, a map
is serialized to a field (if not configured otherwise) using Avro
serialization.
·
Gora provides first-class support for Hadoop MapReduce.
DataStore implementations are responsible for partitioning the data (which is
then converted to Hadoop Splits), and all the locality information is again
obtained from the data store. Developing MapReduce jobs with Gora is really
easy.
This is all great, but what
happens when we consider other data models (and consequently different
products/data stores) which do not have any relation or logical affiliation
with Avro, or that do not work on some of the other decisions stated above? To
add to this we still wanted to be able to use the same data structures to
persist objects to such data stores. Oh and by the way, we also want to be able
to use Pig or Cascading within Gora jobs to mine the data stored within the
data stores.
Read on to see how we essentially
restructured Gora's core module, the main change
being the
separation
of
Avro
as
persistence layer
and
an
actual
abstract
persistence layer.
This effectively now enables users to extend the Gora model to use DynamoDB,
Google's
AppEngine,
Microsoft
Azure,
and
other web services
which Gora
might
include
in
the
future for persistent
storage and/or analysis.
Technical
Challenges and Detail
The
Gora API is based on <key, value> pairs, rather than just Java beans. The
original architects also believed that the object mapping layer should be tuned
for batch operations (like first class object re-use support).
As
we mentioned above, Gora uses Avro to generate data beans from Avro schemas
(avsc files). Moreover, traditionally most of the serializations from across
across Gora modules are delegated to Avro due its performance advantages and
recognition within the data serialization space.
Justification
behind this architecture decision was to provide a simple but direct way of
mapping user input schema(s) into objects which can be persisted via Gora's
API.
For
example, a user can simply define the object to store by creating a file
containing the required schema using Avro - JSON notation:
{
"type": "record",
"name": "User",
"namespace":
"org.apache.gora.examples.generated",
"fields" : [
{"name": "firstname",
"type": "string"},
{"name": "lastname",
"type": "string"},
{"name": "password",
"type": "string"}
]
}
More
complex data types can also be serialized within supported data stores e.g. a
Map is serialized to a field (if not configured otherwise) using Avro.
One
noticeable problem with this architecture and representation though is Gora's
direct dependence on disk-based serialization using Avro. When we embarked upon
writing the gora-dynamodb module, there were several incompatibilities
attributable to the initial Avro architecture. It was therefore essential to
create an extra layer for persisting data into non-disk-backed systems such as
web-service-backed data stores.
This
was overcome by creating a clean abstraction for Gora's Persistence layer. The
objective of using this Persistent type is to allow developers to re-use
previously generated data beans across any of the supported data stores easily
enabling them to extend their existing Gora systems to utilize the web-services
API as well.
Using
web-service modules should be as easy as defining the schema to be persisted,
and setting the necessary credentials into the default gora.properties file.
Right now, the new gora-dynamodb module is accompanied by its own data compiler
(modeled on Avro's SpecificCompiler but implementing annotations and special
requirements as used within the Amazon SDK and the Amazon platform itself) to
produce its data beans. In the future we expect different (new) data stores to
follow this trend having their own compiler which can be used by running a main
compiler script with different options. Hopefully this has given a flavor of
how existing Gora implementations can be extended to use Gora to persist their
data into the cloud and also how new developers can extend this model to suit
dynamic requirements for their big data. In the next section we close with a
brief discussion on what the future holds for Gora through 2013 and beyond.
Gora
>= 0.4 Roadmap
Over the last few years Gora has grown in popularity and will be participating in numerous Google Summer of Code (GSoC) projects this year building on the success we enjoyed at last years program. The first project planned for this year aims to use Cascading as an alternative data processing paradigm to MapReduce. Cascading is a Java application framework that enables typical developers to quickly and easily develop rich data analytics and data management applications that can be deployed and managed across a variety of computing environments. Cascading works seamlessly with Apache Hadoop 1.0 and API compatible distributions. In this respect we are really looking forward to enriching Gora's capability functionality to not only write, read and persist data, but also to process huge amounts of data on the fly.
The second GSoC project proposes to add another data store; Oracle-NoSQL. The Oracle NoSQL database is a distributed key-value database. It is designed to provide highly reliable, scalable and available data storage across a configurable set of systems that function as storage nodes. As one will have guessed, this will fit in perfectly with our vision to make Gora the number one persistence layer for big data. Additionally, it will help Gora to keep the community momentum going by supporting another popular data store within the market place. Finally, Gora is featuring an integration with Apache Giraph which is an iterative graph processing system built for high scalability. The project aims to provide Giraph with different mechanisms to persist and retrieve data into Gora datastores.
And
Finally...
Want
to know more about Gora? Easy, just head over to our downloads page and check out the most recent 0.3 stable release.
Want
to know more about Gora including the most up-to-date news within the
community? It's all linked to from our main site. You may wanted to get started with our
Log Manager tutorial!
About
the Authors:
Renato
Marroquin is a Computer Science Master by the Pontifical
University of Rio de Janeiro with the thesis titled "Experimental
Statistical Analysis of MapReduce Jobs". He is currently a Computer
Science Professor at Universidad Catolica San Pablo in
|
Lewis holds his PhD in Legislative Informatics
from
|
Please feel free to post any comments on this post here or alternatively on dev@gora.apache.org
ReplyDeleteThank you
Lewis