Skip to main content

In-memory computing architecture with Hadoop

With increasing demand for higher performance and real time access to analytics, the adoption of In-memory Computing (IMC) software has been increasing dramatically. Sometimes, thought to be a substitute for Apache Hadoop, numerous real life implementations have in fact shown the opposite. IMC today is complementing Apache Hadoop in enterprise architecture to form scalable, low latency, high performance systems. We look at the following category of IMC solution companies to list down how they are integrating with Hadoop:
-          Analytical In-Memory Database
-          In-Memory Database for OLTP
-          In-Memory Data Grids
-          In-Memory Analytics and Visualization
-          Complex Event Processing

In this 2 part post, we first list down 15 IMC companies and their Hadoop connection tactics.

Hadoop integration tactic
With Aerospike Cross Data Center Replication (XDR), it is possible to connect the distributed Aerospike database to data warehouses, including Hadoop-powered ones, using the replication software which puts a big emphasis on real-time apps.
Altibase Loader for Hadoop facilitates high speed  data access and transfer between Hadoop and Altibase. With a simple command line interface, it aids to import data from/to Hadoop cluster and Altibase Database as well as export data from/to Hadoop cluster and Altibase Database.
The Couchbase Hadoop Connector has been developed in conjunction with Cloudera to allow Hadoop users an easy method of moving data back and forth between Couchbase and Hadoop.
This plugin allows to connect to Couchbase Server to stream data into HDFS or Hive for processing with Hadoop. Akin to Sqoop for imports and exports from other databases, this plugin uses a similar command line argument structure.
Although not a direct use case, Esper CEP has a HDFS input and output adapter which can be used for integrating with HDFS.
With an external script within any SQL statement, EXAsol internally connects to the redirector which starts the Hadoop job by using the specified configuration. This procedure is completely transparent to the user of the SQL statement, but provides the powerful possibility of extracting data from and exporting data into Hadoop systems. Furthermore, Hadoop Integration Service lets smoothly distribute parts of the MapReduce job between Hadoop and EXASolution. This makes it possible to just read the raw files from HDFS and completely process them within EXASolution. Additionally, it allows to export a complete set of tables or the result of a query from EXASolution into the Hadoop system.
Fujitsu proprietary Distributed File System, which is a part of Interstage Big Data Parallel Processing Server, enables direct access to storage system data during data processing by Hadoop. The built-in memory cache feature allows memory cache to be effectively allocated to slave servers, contributing to faster data processing. This allows Apache Hadoop and applications to share data, eliminating the overhead of data transfer to/from HDFS.
GigaSpaces Technologies
GigaSpaces allow you to combine its In Memory Data Grid (IMDG) with back end databases like HBase, Cassandra and MongoDB. Some of the key features of GigaSpaces XAP 9.0 include real-time streaming data processing, parallel processing, fine grained data compression and  reduced memory footprint. XAP acts as a mediating layer between itself and Cassandra. This mediating layer can fetch frequently-used data from the XAP in-memory data grid, and fall back to Cassandra if the data cannot be found in XAP.
GridGain’s In-Memory Accelerator for Hadoop is based on dual-mode, high-performance in-memory file system that is 100% compatible with Hadoop HDFS – and an in-memory optimized MapReduce implementation. GridGain’s in-memory file system (GGFS) supports dual-mode that allows it to work as either a standalone primary file system in the Hadoop cluster, or in tandem with HDFS, serving as an intelligent caching layer with HDFS configured as the primary file system. GridGain’s in-memory MapReduce allows to effectively parallelize the processing of in-memory data stored in GGFS. It eliminates the overhead associated with job tracker and task trackers in a standard Hadoop architecture while providing low-latency, HPC-style distributed processing.
Hazelcast can be used as distributed cache with HBase as persistence layer with custom code libraries. On data retrieval, if data key is not found in memory, Hazelcast will look at HBase. Similarly when inserting a key-value pair, Hazelcast will persist it to HBase.
IBM DB2 with BLU Acceleration uses dynamic in-memory columnar technologies. DB2 integration with InfoSphere BigInsights Hadoop distribution has two main components: the IBM InfoSphere BigInsights Jaql server and DB2 user-defined functions (UDFs). The Jaql server is a middleware component that can accept Jaql query processing requests from multiple DB2 for Linux, UNIX, and Windows clients.
Kognito provides a HDFS connector for defining access to HDFS file system, external table accesses row-based data in HDFS and dynamic access or “pin” data into memory. The Filter Agent Connector uploads agent to Hadoop nodes The query passes selections and relevant predicates to agent while data filtering and projection takes place locally on each Hadoop node and only data of interest is loaded into memory via parallel load streams.
The Microsoft SQL Server SQOOP Connector for Hadoop is now part of Apache SQOOP 1.4. Through its in-memory OLTP technologies, Microsoft SQL Server uses a new data structure to minimize locking and latching, and enable IT to select “hot” tables for in-memory. With the Hive ODBC Driver that connects SQL Server to Hadoop, customers can use Microsoft BI tools like PowerPivot and Power View in SQL Server 2012 to analyze all types of data, including unstructured data.
ODBC Connector for MicroStrategy enables users to access Hadoop data through the Business Intelligence (BI) application MicroStrategy. The driver achieves this by translating Open Database Connectivity (ODBC) calls from MicroStrategy into SQL and passing the SQL queries to the underlying Impala or Hive engines. Further, with MicroStrategy 9, Amazon Elastic MapReduce customers can interact with a point-and-click interface to query data without developing HiveQL scripts or MapReduce jobs
Oracle TimesTen In-Memory Database (TimesTen) is a full-featured, memory-optimized, relational database with persistence and recoverability. Queries can be run in Hive using federated Hadoop function in Oracle DB or Exalytics. The aggregates are then passed on to TimesTen database and refreshed when new data is loaded in HDFS. This federated architecture gives instantaneous results on analytical queries done via TimesTen cache tables.
Pivotal aims to leverage distributed memory across a large farm of commodity servers to offer very low latency SQL queries and transactional updates. The memory tier seamless integrates with PivotalHD’s Hadoop file system. HDFS can be configured to be the destination for all writes captured in the distributed in-memory tier (distributed, high speed ingest) or can be configured as the underlying read-write storage tier for the data cached in memory. HDFS provides reliable storage and efficient parallel recovery during restarts. Historical data that cannot fit in main memory can automatically be faulted in from HDFS.

In the second part of the post, we will look at another set of 15 IMC offering companies. Read on.

Leveraging Hadoop with in-memory computing>>


  1. Great post!

    GigaSpaces offers integration with Storm. It allows large scale workflow processing application highly scalable. See more:

    GigaSpaces offers SSD integration. It allows the data grid to leverage any block storage device managing many Tera Bytes of data across RAM and SSD in a grid configuration.


  2. Hello - Great post.

    Kognitio offers a tight integration with Hadoop. There are two different connection methods operating in unison to offer massive performance. Thousands of parallel connection threads are run between Kognitio's in-memory analytical platform and HDFS, offering load speeds on the order of 5 GB/second. This avoids HIVE batch processing and enables interactive SQL on top of Hadoop. More on this at !!



Post a Comment

Popular articles

5 online tools in data visualization playground

While building up an analytics dashboard, one of the major decision points is regarding the type of charts and graphs that would provide better insight into the data. To avoid a lot of re-work later, it makes sense to try the various chart options during the requirement and design phase. It is probably a well known myth that existing tool options in any product can serve all the user requirements with just minor configuration changes. We all know and realize that code needs to be written to serve each customer’s individual needs. To that effect, here are 5 tools that could empower your technical and business teams to decide on visualization options during the requirement phase. Listed below are online tools for you to add data and use as playground. 1)      Many Eyes : Many Eyes is a data visualization experiment by IBM Research and the IBM Cognos software group. This tool provides option to upload data sets and create visualizations including Scatter Plot, Tree Ma

Data deduplication tactics with HDFS and MapReduce

As the amount of data continues to grow exponentially, there has been increased focus on stored data reduction methods. Data compression, single instance store and data deduplication are among the common techniques employed for stored data reduction. Deduplication often refers to elimination of redundant subfiles (also known as chunks, blocks, or extents). Unlike compression, data is not changed and eliminates storage capacity for identical data. Data deduplication offers significant advantage in terms of reduction in storage, network bandwidth and promises increased scalability. From a simplistic use case perspective, we can see application in removing duplicates in Call Detail Record (CDR) for a Telecom carrier. Similarly, we may apply the technique to optimize on network traffic carrying the same data packets. Some of the common methods for data deduplication in storage architecture include hashing, binary comparison and delta differencing. In this post, we focus o

In-memory data model with Apache Gora

Open source in-memory data model and persistence for big data framework Apache Gora™ version 0.3, was released in May 2013. The 0.3 release offers significant improvements and changes to a number of modules including a number of bug fixes. However, what may be of significant interest to the DynamoDB community will be the addition of a gora-dynamodb datastore for mapping and persisting objects to Amazon's DynamoDB . Additionally the release includes various improvements to the gora-core and gora-cassandra modules as well as a new Web Services API implementation which enables users to extend Gora to any cloud storage platform of their choice. This 2-part post provides commentary on all of the above and a whole lot more, expanding to cover where Gora fits in within the NoSQL and Big Data space, the development challenges and features which have been baked into Gora 0.3 and finally what we have on the road map for the 0.4 development drive. Introducing Apache Gora Although