In-memory data model with Apache Gora

Open source in-memory data model and persistence for big data framework Apache Gora™ version 0.3, was released in May 2013. The 0.3 release offers significant improvements and changes to a number of modules including a number of bug fixes. However, what may be of significant interest to the DynamoDB community will be the addition of a gora-dynamodb datastore for mapping and persisting objects to Amazon's DynamoDB. Additionally the release includes various improvements to the gora-core and gora-cassandra modules as well as a new Web Services API implementation which enables users to extend Gora to any cloud storage platform of their choice.
This 2-part post provides commentary on all of the above and a whole lot more, expanding to cover where Gora fits in within the NoSQL and Big Data space, the development challenges and features which have been baked into Gora 0.3 and finally what we have on the road map for the 0.4 development drive.

Introducing Apache Gora
Although there are various excellent ORM frameworks for relational databases, data modeling in NoSQL data stores differ profoundly from their relational cousins. Moreover, data-model agnostic frameworks such as JDO are not sufficient for use cases, where one needs to use the full power of the data models in column stores for example. Gora fills this gap by providing an easy-to-use in-memory data model and persistence for big data framework with data store specific mappings and built in Apache Hadoop™ support.
The overall goal for Gora is to become the standard data representation and persistence framework for big data. From a development point of view, this challenge however comes in many shapes and sizes and in many flavors, some of which we discuss shortly. For the time being however lets consider the following  (founding) goals for Gora as a top level project within the Apache Software Foundation. Gora aims to provide:
·         Data Persistence : Persisting objects to Column stores such as Apache HBase™, Apache Cassandra™, Hypertable; key-value stores such as Voldemort, Redis, etc; SQL databases, such as MySQL, HSQLDB, flat files in local file system of Hadoop HDFS;
·      Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location;
·      Indexing : Persisting objects to Apache Lucene and Apache Solr indexes, accessing/querying the data with Gora API;
·      Analysis : Accesing the data and making analysis through adapters for Apache Pig, Apache Hive and Cascading;
·      MapReduce support : Out-of-the-box and extensive MapReduce (Apache Hadoop™) support for data in the data store.
In terms of where the Gora community is relative to the above, there are currently data store implementations for the distributed key/value store Apache Accumulo, data serialization system Apache Avro, column family data store Cassandra, distributed big data store HBase, the HDFS and now Amazon's DynamoDB.

One underlying formality which has come apparent when attempting to provide a common persistence layer for all of the above data stores (and more), is that the NoSQL and Big Data communities are moving extremely quickly. As expected, communities and consequently code bases also move in myriad of directions meaning that an overwhelming practical hurdle involves keeping up with the rush. The NoSQL space in particular is young and has such experienced astronomical growth in recent years with the best open source products being the ones which have ridden the wave. It is however important to consider that the NoSQL storage abstraction space is even younger, consequently Gora (being the leading open source ASLv2 licensed NoSQL storage abstraction product) is evolving in parallel. By nature the Gora community is diverse in its entirety and the product reflects this entirely. In the second part of this post we provide somewhat of a deep dive covering both the goodies and the technical challenges we encountered during the development of Gora 0.3.

2nd part of the post:
AmazonDynamoDB datastore for Gora
(to be published on Wednesday, 19 June 2013)


About the Authors:

Renato Marroquin is a Computer Science Master by the Pontifical University of Rio de Janeiro with the thesis titled "Experimental Statistical Analysis of MapReduce Jobs". He is currently a Computer Science Professor at Universidad Catolica San Pablo in Arequipa, Peru and also an Apache Gora PMC Member and Committer, Open Source and Big Data Enthusiast.

Lewis John McGibbney holds his PhD in Legislative Informatics from Glasgow Caledonian University in Glasgow Scotland. He is currently a Post Doctoral Research Scholar at Stanford University, CA ,a member of the Apache Software Foundation and Apache Committer at several Apache projects including Gora where he is VP.
Read more »

Beyond NSA, the intelligence community has a big technology footprint

While all through the past few days the focus has been on NSA activities, the discussion has often veered around the technologies and products used by NSA. At the same time, a side discussion topic has been the larger technical ecosystem of intelligence units. CIA has been one of the more prolific users of Information Technology by its own admission. To that extent, CIA spinned off a venture capital firm In-Q-Tel in 1999 to invest in focused sector companies. Per Helen Coster of Fortune Magazine, In-Q-Tel (IQT) has been named “after the gadget-toting James Bond character Q”.

In-Q-Tel states on its website that “We design our strategic investments to accelerate product development and delivery for this ready-soon innovation, and specifically to help companies add capabilities needed by our customers in the Intelligence Community”. To that effect, it has made over 200 investments in early stage companies for propping up products. Being a not-for-profit group, unlike Private Venture capitalists, the ROI is not the primary motive. Beyond funds, the startups have benefited from a government organization association. “It’s an ego boost to get a phone call from In-Q-Tel, but more importantly, it’s a direct path to major government customers.” states Venture Beat’s Christina Farr.

The investments usually are not more than 2 million $ and attract other private venture capital firms along. According to an analysis on Linksviewer, “In-Q-Tel has very few occasions where it has more than one common investment with another investor, and zero occasions where it has more than two investments in common.” Among the marquee investments in the past, both Google and Google Earth have received investments from IQT.

From its vast portfolio mentioned on IQT's web site, we list below the companies which have a cloud, storage, search, analytics and/or Big Data footprint. 

·  10gen:  the company behind MongoDB, a leading NOSQL database.
·  Adaptive Computing: offers a portfolio of Moab cloud management and Moab HPC workload management products for HPC, data centers and cloud.
·  Adapx: helps to collect data with paper forms, GIS maps, and/or all-weather field notebooks using digital pens; software integrated with Excel, Sharepoint and ArcGIS among others.
·  Apigee: provides API technology and services for enterprises and developers
·  Basis Technology: provides software solutions for text analytics, natural language processing, information retrieval, and name resolution in over twenty languages
·  Cleversafe: provides resilient storage solutions, ideally suited for storage clouds and massive digital archives
·  Cloudera: leading provider of Hadoop distribution and services.
·  Cloudant: provides “data layer as a service” for loading, storing, analyzing, and distributing application data
·  Delphix: provides solutions for provisioning, managing refreshing, and recovering databases for business-critical applications
·  Digital Reasoning: provides tools people need to understand relationships between entities in vast amounts of unstructured, semi-structured and structured data.
·  FMSAdvanced Systems Group: focuses on visualization and analytical solutions, as well as solutions in the area of Geospatial/Temporal Analysis and Situational Awareness
·  Huddle: provides enterprise content software with intelligent technology for recommending valuable information to users, with no need for search.
·  LucidWorks: provides search solution development platforms built on the power of Apache Solr/Lucene open source search via enterprise-grade subscriptions.
·  Narrative Insight: provides automated business analytics and natural language communication technology that transform data into personalized stories.
·  Novo Dynamics: develops intelligent information capture software and provides advanced analytics solutions that transform data into actionable insights
·  NetBase: provides semantic technology tool that reads sentences to surface insights from public and private online information sources.
·  piXlogic: has created software that automatically analyzes and searches images and video based on their visual content
·  Palantir Technologies: “was developed to address the most complex information analysis and security challenges faced by the U.S. intelligence, military, and law enforcement communities”. Provides extensible software solution designed from the ground up for data integration and analysis.
·  Pure Storage: the all-flash enterprise storage company, enables the broad deployment of flash in the data center
·  Power Assure: developer of data center infrastructure and energy management software for large enterprises, government agencies, and managed service providers
·  Platfora: works with existing Hadoop clusters and provide business intelligence software for Big Data.
·  Quantum 4D: integrates large data sets into sophisticated models producing moving visual representations that enable users to identify trends, relationships, and anomalies in real-time data
·  ReversingLabs: provides tools for analysis of all unknown binary content which may involve removing of all protection and obfuscation artifacts, unwrapping formatting elements and extracting relevant meta-data.  Results are compared against analysis reports on billions of goodware and malware files.
·  Recorded Future: provides software which utilizes linguistics and statistics algorithms to extract time-related information, and through temporal reasoning, help users understand relationships between entities and events over time, to form the world’s “first temporal analytics engine”.
·  Signal Innovations Group, Inc. (SIG): provides customers with an advanced platform for video analysis, including motion tracking, enhanced metadata creation, and motion-based anomalous behavior detection
·  SitScape: provides software to assemble situational business dashboards based on on-demand componentization and user interface virtualization of disparate enterprise-wide live Web applications and digital assets.
·  StreamBase Systems: provides software for high-performance Complex Event Processing (CEP) that analyze and act on real-time streaming data for instantaneous decision-making
·  Traction software: provides enterprise hypertext collaboration software which combines wiki-style group editing and the simplicity of a blog to provide business and government organizations with enterprise software
·  Visible Technologies: provides enterprise ready social media solution, offering a combination of software and services to harness business value from social communities.
·  Weather Analytics: delivers global climate intelligence by providing statistically stable, gap-free data formed by an extensive collection of historical, current and forecasted weather content, coupled with proprietary analytics and methodologies.



Read more »

Options for MapReduce with HPC

There has been a strong motivation and desire for co-existence of Hadoop and HPC clusters. While it is true that the HPC world would like to optimize their clusters and leverage MapReduce, it is also true that the Hadoop world would like to invoke MPI apps from their framework. Let’s look at some of the options which may enable this.

Commercial Enterprise Product :


The Advanced Edition of Platform Symphony includes an Apache Hadoop compatible MapReduce implementation optimized for low latency, reliability and resource sharing. Along with IBM Infosphere BigInsights (IBM’s hadoop distribution), Symphony delivers a multi-tenant, heterogeneous application cluster with higher utilization and performance. It ensures efficient sharing to ensure multi-tenancy at both workload and resource layer.

 ( Image source: IBM presentation - link)


Research projects :


myHadoop was developed by Sriram Krishnan and Shava Smallen from San Diego Supercomputer Center (SDSC). It aims to provision Hadoop instances on traditional supercomputing resources on the fly via regular batch scripts. This open source tool has been tested on SDSC Triton, TeraGrid and UC Grid resources.
Hadoop on HPC: Main Challenges
 (Image source – Sriram Krishnan’s presentation –link)


MR job adaptor was developed by Marcelo Neves, Tiago Ferreto, and Cesar De Rose of PUCRS in Brazil. The adaptor aims to:
- Allows transparent MR job submission on HPC clusters
– Minimizes the average turnaround time
– Improve the overall utilization, by exploiting unused resources in the cluster

On the shelf :


  • MR+ (Open MPI) :

MR+ has been one of the most ambitious projects on the subject. The project caught fancy of Greenplum team but it is unclear at this time if it will be available for general use. The project claimed to be 10x faster than YARN and aimed support for multiple HPC environments (rsh, SLURM, Torque, Alps, LSF, Windows, etc.). It was also claimed that Mappers and reducers could be written in any of the typical HPC languages (C, C++, and Fortran) as well as Java.

Hadoop On Demand (HOD) was the original open source project on the subject but for some strange reasons was abandoned instead of being upgraded or redesigned. HOD could provision virtual Hadoop clusters over a large physical cluster and used Torque resource manager to do node allocation. On the allocated nodes, it could start Hadoop Map/Reduce and HDFS daemons. 

In the reckoning :


The veritable choice of Twitter, AirBnb and UC Berkley, Mesos is a platform which can run Hadoop, MPI, Hypertable, Spark and other applications. Supercomputing success stories with Mesos though are still awaited from the wider use base.
Mesos can be used to:

  • Run multiple instances of Hadoop on the same cluster to isolate production and experimental jobs, or even multiple versions of Hadoop.
  • Run long-lived services (e.g. Hypertable and HBase) on the same nodes as batch applications and share resources between them.
  • Build new cluster computing frameworks without reinventing low-level facilities for farming out tasks, and have them coexist with existing ones.



While YARN is the future of open source MapReduce, there is still lack of clarity on absolute integration with HPC. While project Hamster was touted as MPI plug-in for YARN, it instead made the MR+ journey to Greenplum as listed above. There has been a very recent contribution of MPI on YARN with MPICH2. We will wait and watch this space.



Read more »

Hadoop High 5 with MapR's John Schroeder



1. Give us a glimpse of your journey with MapR so far.

MapR Technologies has experienced major customer acquisition and corporate expansion since I cofounded the company in 2009. MapR is currently in hyper-growth mode continuing to deliver Big Data platform technology and expand global service and support. MapR launched as a company in 2011 after a two year, well-funded engineering effort, came out of stealth mode with a major strategic OEM agreement and released the industry’s only differentiated Hadoop platform that addressed reliability, availability, ease of use and performance issues required to support critical applications and enable broad adoption. 

At MapR, we’re believers in both the internal and external cloud and in 2012, the MapR Distribution became available on the Google Compute Engine and within the Amazon Elastic MapReduce service on Amazon.com - the two 800 pound gorilla providers of cloud services. Our technology leadership and reputation for pioneering product innovation is now enabling thousands of customers to better manage and analyze Big Data.

Today, the MapR Big Data Platform is being used in production deployments across financial services, government, healthcare, manufacturing, retail and Web 2.0 companies to drive significant business results and includes the analysis of hundreds of billions of objects a day; 90% of the Internet population monthly; and over a trillion dollars of retail transactions annually.

MapR’s latest technology accomplishments include the availability of the MapR M7 Distribution. This is providing ground breaking capabilities for Apache HBase™ NoSQL applications to enhance Big Data operations. We also recently announced the distribution of LucidWorks Search with the MapR Platform. We have also worked as part of the open source community to help incubate and spearhead the development of Apache Drill for low latency, large scale interactive SQL. On a single platform customers can now perform predictive analytics, full search and discovery, and conduct advanced database operations. MapR is pushing the envelope with performance and manageability on Hadoop at a time when Hadoop is crossing over from early adopters to full production mode in the enterprise.

2. Although Hadoop ecosystem has been pretty dynamic, what do you see as major trends for next 3 years?

During the next three years Hadoop will continue to expand capabilities and will be used in a growing number of applications, which will further establish its market leadership in Big Data analytics. Some important trends will include:

One platform for the broadest range of use cases across organizations.


Our more advanced customers are blazing the trail and have clearly communicated they want MapR to continue to push the limits for providing one Big Data platform they can provision for the broadest range of use cases across their organizations. Big Data platforms represent big investments. The return on that investment is highest when scalable decision support, operational, batch, interactive, real time, production, and ad hoc functionality are provided by the platform. Use case support requires multiple industry standard APIs including Hadoop, SQL, NoSQL, file-based, machine learning, real time and search to be seamlessly run against the platform without requiring moving data between platforms. Multi-tenancy requires governance of the compute, network and storage resources to ensure service levels. Multi-tenancy also requires securing data to preclude inadvertent leaks or purposeful attacks.

Security concerns finally addressed.

Security continues to be a barrier to adoption of external and internal cloud architectures like MapR and those provided by MapR partners Amazon and Google. In our personal lives, privacy and security are eroding as millions chronicle their lives in social media. Few people are concerned that their email and smart phone communications are being used to learn more about them, hopefully only for benign targeted marketing initiatives and not more serious intrusions. The corporate world is moving in the opposite direction with decreasing tolerance for information leaks that result in identity theft, release of health records or confidential information and result in regulation violations, lawsuits and, in the case of the federal government, threaten homeland security. During 2013 and going forward, MapR and the industry as a whole, will make significant progress securing Big Data and providing authorization, access control and encryption.

Cloud-based architectures eclipse traditional enterprise architectures.

The traditional enterprise architecture required customers to pay a premium for software and hardware resources with the promise of five 9’s and that basically they wouldn’t fail. An example of this promise is paying between $5,000 and $15,000 per terabyte for enterprise class SAN and NAS storage based on this dependability promise. These devices have spinning disks and heat generating CPUs and will fail over time. Cloud architectures pioneered by Web 2.0 companies, and the basis for MapR, target a different design center. Cloud architectures assume drives, servers, switches and software will have failures and the cloud is architected to transparently absorb these failures without any service disruption. Cloud architectures like MapR use redundancy and instant, stateful failover for all hardware and software resources. Typical cloud service hardware servers provide complete compute and storage for less than $400 per terabyte. Cloud architectures, like MapR, reliably and securely knit together 10s, 100s, and 1000s of these servers resulting in an increasingly abundant storage and compute resource that has created this “Big Data” market disruption.

HBase will become a popular platform for Blob Stores.

HBase provides a non-relational database atop the Hadoop Distributed File System (HDFS). HBase applications have several advantages in certain distributions, including the creation of a unified platform for tables and files, no need for splits or mergers, centralized configuration and logging, and consistent throughput with low latency for database operations. Some distributions also add support for high availability, data protection with mirroring and snapshots, automatic data compression, and rolling upgrades.
One application that is particularly well suited for HBase is Blob Stores, which require large databases with rapid retrieval. Blobs are binary large objects (typically images, audio clips or other multimedia objects), and storing blobs in a database enables a variety of innovative applications. One example is a digital wallet where users upload their credit card images, checks and receipts for online processing, easing banking, purchasing and lost wallet recovery.

NoSQL expands Hadoop from batch analytics to operational use cases.

With its roots in search engines, Hadoop was purpose-built for cost-effective analysis of datasets as enormous as the World Wide Web. The millions of pages of content are analyzed in batches and then served up during searches in real-time. The advances mentioned above and other improvements in Hadoop’s capabilities now make it possible to stream data into a cluster and analyze it in an interactive fashion—both in real time. Use cases like telecommunications billing, and in some cases logistics applications, have outgrown traditional relational database architectures. Telcos thirty years ago tracked a few hundred calls per week to a single home phone. Today they track data and voice transactions to a multitude of devices in one household. Hadoop and HBase provide scale and efficiency advantage for these types of applications over traditional relational data models.

3. Among the various implementations for Hadoop, which use cases are you most excited about?

Partnering with customers to build apps they couldn’t build before while also cutting their costs is exciting. Take cable providers as an example, they can offload their data warehouse processing at a fraction of the cost. We recently implemented a data warehouse offload resulting in over $25M customer savings over the next 2 years. New applications built on the new architecture gives the provider the ability to use recommendation engines for content. 

As a cable customer I’m ecstatic that by using MapR they’ll be able to provide more relevant content on demand. Soon we’ll see relevant content, based on our household viewing history, when searching for on demand content rather than today where all subscribers are presented with non-personalized choices.

4. If there is 1 tip that you would like to give to Hadoop enthusiasts, what would that be?

The growing number of organizations using Hadoop have found it to be an indispensable analytical tool capable of unlocking the value previously hidden deep in data to improve decision-making and gain a competitive edge. Its many advantages have given rise to an entire ecosystem, as well as cloud-based offerings from Google, Amazon, and other service providers. But, organizations have also discovered some serious limitations that can require considerable expertise to integrate, use and manage; and often require considerable effort to protect the data and keep the cluster of servers operational.


I recommend that companies look at all the critical dimensions that ensure Hadoop is deployable in a wide variety of enterprise environments. This means Hadoop must be easy to integrate into the enterprise, as well as more enterprise grade in its operation, performance, scalability and reliability. Specifically, Hadoop platforms should provide enterprise-grade data protection, full high availability, and the ability to integrate into existing environments.

5. Considering you had Time Machine, what would you do with it?

I’d love to go back in time and prevent historical atrocities, but I’ve been educated by too many sci-fi movies that the best intentions can result in disastrous results. Meeting our past religious and political leaders would be insightful and inspiring. That said, I’d probably go back to the final two minute of Super Bowl XLVII and ask Jim Harbaugh to let Frank Gore and Kap use their amazing skills.

Read more »

Caching mechanisms in data visualization software


As the golden boy of data visualization, Tableau Software (NYSE: DATA) made its debut on the stock exchange, it ended day 1 at whopping market capitalization of 2.9 billion $. This emphasized once more hadoopsphere’s prediction that visualization technologies will hold fore in the year 2013.

One of the key features of the visualization software like Tableau’s has been the caching technology. Caching is used to improve overall performance and user experience of Big Data systems. Most of the visualization suites store selection states of queries in memory. When the user makes the same set of ‘selections’, the cache is leveraged to:
-          improve the response time
-          eliminate redundant retrieval
-          reduce storage requirement
-          bring down network traffic

Listed below are current key architectural trends in designing caching mechanisms for visualization software:
  • Dashboard output caching in various formats like HTML, PDF, Excel, and Flash for instant retrieval.
  • Leverage bigger memory space on 64-bit computers.
  • Automatic caching at multiple levels, including element list, metadata object, report dataset, XML definition, document output, and database connection caching.
  • Persist the cache results to disk.
  • Share caches across all cluster nodes.
  • Caching data locally on the mobile devices.
  • Providing admin configuration options for maximum cache size, cache wipe frequency, and options for automatically rebuilding new caches.
  • Securing and wiping the cache on device to prevent data theft.


While the above may have given a good idea of what’s steaming up here, we go down further to get into some real details. We look at two variant proposals of caching mechanism proposed by Tableau and QlikTech. (Please note that these may be used in a divergent manner in current versions of these software.)

(1)   Tableau –

In the published architecture, Tableau uses a data interpreter module which consists of:
- query descriptions for querying databases;
- query cache for storing database query results;
- pane-data-cache for storing separate data structure for each pane in a visual table that is displayed by visual interpreter module.

In one of the implementations for presenting a visual
representation of a query, a determination is made if the query already exists in the query cache. If it exists already, the result is retrieved from the query cache. However, if it does not exist, the target database is queried. "If such a database query is made, data interpreter module will formulate the query in a database-specific manner. For example, in certain instances, data interpreter module will formulate an SQL query whereas in other instances, data interpreter module will formulate an MDX query." Thereafter, the results of the query are added to the query cache.

The data retrieved in the processing steps above can contain data for a set of panes. When this is the case, the data that is fetched above is partitioned into a separate data structure for each pane using a grouping transform that is conceptually the same as a "GROUP BY" in SQL except separate data structures are created for each group rather than performing aggregation. Each output data structure from group-tsf is added to pane-data-cache for later use by visual interpreter module.

In the visual interpreter module, the pane graphic is created using a described specification. Primitive objects like bars in a barchart and their encoding objects for visual properties are created. Thereafter, the per-pane transform to describe tuples display order is applied. The data for pane is retrieved from pane-data-cache using p-lookup. The data (which may be a subset of tuples retrieved from query) is thus bound to a pane.

Source: above text derived from patent US8140586 B2

In the current implementations of Tableau, you may select cache refresh frequency from any of the following options:
·         Refresh Less Often—Data is cached and reused whenever it is available regardless of when it was added to the cache. This option minimizes the number of queries sent to the database. Select this option when data is not changing frequently. Refreshing less often may improve performance.
·         Balanced—Data is removed from the cache after a specified number of minutes. If the data has been added to the cache within the specified time range the cached data will be used, otherwise new data will be queried from the database.
·         Refresh More Often—The database is queried each time the page is loaded. The data is still cached and will be reused until the user reloads the page. This option will ensure users see the most up to date data; however, it may decrease performance.


(2)   QlikTech

In a published architecture in patent US8244741 B2 which utilizes a unique two step caching architecture, QlikTech states in the abstract:

A method for retrieving calculation results, wherein a first input or selection causes a first calculation on a database to produce an intermediate result, and a second selection or input causes a second calculation on the intermediate result, producing a final result. These results are cached with digital fingerprint identifiers.

A first identifier is calculated from the first selection, and a second identifier is calculated from the second selection and the intermediate result. The first identifier and intermediate result are associated and cached, while the second identifier and final result are associated and cached.

The final result may be then retrieved using the first and second selections or inputs by recalculating the first identifier and searching the cache for the first identifier associated with the intermediate result. Upon locating the intermediate result, the second identifier may be recalculated to locate the cached second identifier associated with the final result.


To sum up, we observe that as we move from the traditional analysts to the age of data scientists, the tool makers have alongside been making constant innovations. Visualization is one space which fortunately has been beating the tide. As John Sviokla commented in Harvard Business Review , “So, the good news is that even in a world of information surplus, we can draw upon deep human habits on how to visualize information to make sense of a dynamic reality.”


------------------------------------------
all images taken from Tableau, QlikView website and patents
Read more »