Skip to main content

In-memory computing architecture with Hadoop

With increasing demand for higher performance and real time access to analytics, the adoption of In-memory Computing (IMC) software has been increasing dramatically. Sometimes, thought to be a substitute for Apache Hadoop, numerous real life implementations have in fact shown the opposite. IMC today is complementing Apache Hadoop in enterprise architecture to form scalable, low latency, high performance systems. We look at the following category of IMC solution companies to list down how they are integrating with Hadoop:
-          Analytical In-Memory Database
-          In-Memory Database for OLTP
-          In-Memory Data Grids
-          In-Memory Analytics and Visualization
-          Complex Event Processing

In this 2 part post, we first list down 15 IMC companies and their Hadoop connection tactics.

Hadoop integration tactic
With Aerospike Cross Data Center Replication (XDR), it is possible to connect the distributed Aerospike database to data warehouses, including Hadoop-powered ones, using the replication software which puts a big emphasis on real-time apps.
Altibase Loader for Hadoop facilitates high speed  data access and transfer between Hadoop and Altibase. With a simple command line interface, it aids to import data from/to Hadoop cluster and Altibase Database as well as export data from/to Hadoop cluster and Altibase Database.
The Couchbase Hadoop Connector has been developed in conjunction with Cloudera to allow Hadoop users an easy method of moving data back and forth between Couchbase and Hadoop.
This plugin allows to connect to Couchbase Server to stream data into HDFS or Hive for processing with Hadoop. Akin to Sqoop for imports and exports from other databases, this plugin uses a similar command line argument structure.
Although not a direct use case, Esper CEP has a HDFS input and output adapter which can be used for integrating with HDFS.
With an external script within any SQL statement, EXAsol internally connects to the redirector which starts the Hadoop job by using the specified configuration. This procedure is completely transparent to the user of the SQL statement, but provides the powerful possibility of extracting data from and exporting data into Hadoop systems. Furthermore, Hadoop Integration Service lets smoothly distribute parts of the MapReduce job between Hadoop and EXASolution. This makes it possible to just read the raw files from HDFS and completely process them within EXASolution. Additionally, it allows to export a complete set of tables or the result of a query from EXASolution into the Hadoop system.
Fujitsu proprietary Distributed File System, which is a part of Interstage Big Data Parallel Processing Server, enables direct access to storage system data during data processing by Hadoop. The built-in memory cache feature allows memory cache to be effectively allocated to slave servers, contributing to faster data processing. This allows Apache Hadoop and applications to share data, eliminating the overhead of data transfer to/from HDFS.
GigaSpaces Technologies
GigaSpaces allow you to combine its In Memory Data Grid (IMDG) with back end databases like HBase, Cassandra and MongoDB. Some of the key features of GigaSpaces XAP 9.0 include real-time streaming data processing, parallel processing, fine grained data compression and  reduced memory footprint. XAP acts as a mediating layer between itself and Cassandra. This mediating layer can fetch frequently-used data from the XAP in-memory data grid, and fall back to Cassandra if the data cannot be found in XAP.
GridGain’s In-Memory Accelerator for Hadoop is based on dual-mode, high-performance in-memory file system that is 100% compatible with Hadoop HDFS – and an in-memory optimized MapReduce implementation. GridGain’s in-memory file system (GGFS) supports dual-mode that allows it to work as either a standalone primary file system in the Hadoop cluster, or in tandem with HDFS, serving as an intelligent caching layer with HDFS configured as the primary file system. GridGain’s in-memory MapReduce allows to effectively parallelize the processing of in-memory data stored in GGFS. It eliminates the overhead associated with job tracker and task trackers in a standard Hadoop architecture while providing low-latency, HPC-style distributed processing.
Hazelcast can be used as distributed cache with HBase as persistence layer with custom code libraries. On data retrieval, if data key is not found in memory, Hazelcast will look at HBase. Similarly when inserting a key-value pair, Hazelcast will persist it to HBase.
IBM DB2 with BLU Acceleration uses dynamic in-memory columnar technologies. DB2 integration with InfoSphere BigInsights Hadoop distribution has two main components: the IBM InfoSphere BigInsights Jaql server and DB2 user-defined functions (UDFs). The Jaql server is a middleware component that can accept Jaql query processing requests from multiple DB2 for Linux, UNIX, and Windows clients.
Kognito provides a HDFS connector for defining access to HDFS file system, external table accesses row-based data in HDFS and dynamic access or “pin” data into memory. The Filter Agent Connector uploads agent to Hadoop nodes The query passes selections and relevant predicates to agent while data filtering and projection takes place locally on each Hadoop node and only data of interest is loaded into memory via parallel load streams.
The Microsoft SQL Server SQOOP Connector for Hadoop is now part of Apache SQOOP 1.4. Through its in-memory OLTP technologies, Microsoft SQL Server uses a new data structure to minimize locking and latching, and enable IT to select “hot” tables for in-memory. With the Hive ODBC Driver that connects SQL Server to Hadoop, customers can use Microsoft BI tools like PowerPivot and Power View in SQL Server 2012 to analyze all types of data, including unstructured data.
ODBC Connector for MicroStrategy enables users to access Hadoop data through the Business Intelligence (BI) application MicroStrategy. The driver achieves this by translating Open Database Connectivity (ODBC) calls from MicroStrategy into SQL and passing the SQL queries to the underlying Impala or Hive engines. Further, with MicroStrategy 9, Amazon Elastic MapReduce customers can interact with a point-and-click interface to query data without developing HiveQL scripts or MapReduce jobs
Oracle TimesTen In-Memory Database (TimesTen) is a full-featured, memory-optimized, relational database with persistence and recoverability. Queries can be run in Hive using federated Hadoop function in Oracle DB or Exalytics. The aggregates are then passed on to TimesTen database and refreshed when new data is loaded in HDFS. This federated architecture gives instantaneous results on analytical queries done via TimesTen cache tables.
Pivotal aims to leverage distributed memory across a large farm of commodity servers to offer very low latency SQL queries and transactional updates. The memory tier seamless integrates with PivotalHD’s Hadoop file system. HDFS can be configured to be the destination for all writes captured in the distributed in-memory tier (distributed, high speed ingest) or can be configured as the underlying read-write storage tier for the data cached in memory. HDFS provides reliable storage and efficient parallel recovery during restarts. Historical data that cannot fit in main memory can automatically be faulted in from HDFS.

In the second part of the post, we will look at another set of 15 IMC offering companies. Read on.

Leveraging Hadoop with in-memory computing>>


  1. Great post!

    GigaSpaces offers integration with Storm. It allows large scale workflow processing application highly scalable. See more:

    GigaSpaces offers SSD integration. It allows the data grid to leverage any block storage device managing many Tera Bytes of data across RAM and SSD in a grid configuration.


  2. Hello - Great post.

    Kognitio offers a tight integration with Hadoop. There are two different connection methods operating in unison to offer massive performance. Thousands of parallel connection threads are run between Kognitio's in-memory analytical platform and HDFS, offering load speeds on the order of 5 GB/second. This avoids HIVE batch processing and enables interactive SQL on top of Hadoop. More on this at !!



Post a Comment

Popular posts from this blog

Hadoop's 10 in LinkedIn's 10

LinkedIn, the pioneering professional social network has turned 10 years old. One of the hallmarks of its journey has been its technical accomplishments and significant contribution to open source, particularly in the last few years. Hadoop occupies a central place in its technical environment powering some of the most used features of desktop and mobile app. As LinkedIn enters the second decade of its existence, here is a look at 10 major projects and products powered by Hadoop in its data ecosystem.
1)      Voldemort:Arguably, the most famous export of LinkedIn engineering, Voldemort is a distributed key-value storage system. Named after an antagonist in Harry Potter series and influenced by Amazon’s Dynamo DB, the wizardry in this database extends to its self healing features. Available in HA configuration, its layered, pluggable architecture implementations are being used for both read and read-write use cases.
2)      Azkaban:A batch job scheduling system with a friendly UI, Azkab…

Data deduplication tactics with HDFS and MapReduce

As the amount of data continues to grow exponentially, there has been increased focus on stored data reduction methods. Data compression, single instance store and data deduplication are among the common techniques employed for stored data reduction.
Deduplication often refers to elimination of redundant subfiles (also known as chunks, blocks, or extents). Unlike compression, data is not changed and eliminates storage capacity for identical data. Data deduplication offers significant advantage in terms of reduction in storage, network bandwidth and promises increased scalability.
From a simplistic use case perspective, we can see application in removing duplicates in Call Detail Record (CDR) for a Telecom carrier. Similarly, we may apply the technique to optimize on network traffic carrying the same data packets.
Some of the common methods for data deduplication in storage architecture include hashing, binary comparison and delta differencing. In this post, we focus on how MapReduce and…

Top Big Data Influencers of 2015

2015 was an exciting year for big data and hadoop ecosystem. We saw hadoop becoming an essential part of data management strategy of almost all major enterprise organizations. There is cut throat competition among IT vendors now to help realize the vision of data hub, data lake and data warehouse with Hadoop and Spark.
As part of its annual assessment of big data and hadoop ecosystem, HadoopSphere publishes a list of top big data influencers each year. The list is derived based on a scientific methodology which involves assessing various parameters in each category of influencers. HadoopSphere Top Big Data Influencers list reflects the people, products, organizations and portals that exercised the most influence on big data and ecosystem in a particular year. The influencers have been listed in the following categories:

AnalystsSocial MediaOnline MediaProductsTechiesCoachThought LeadersClick here to read the methodology used.

Analysts:Doug HenschenIt might have been hard to miss Doug…