|
|
There has been a lot of constant interest in Cloudera
Impala, which enables real-time, interactive analytical queries of the data
stored in HBase or HDFS.
Below is a review sheet for Cloudera Impala based on multiple
sources. Feel free to document anything which you may know differently.
What’s there:
-
USP:
o Low latency querying for HDFS
-
Business Use case:
o Can be used for near real time operations
-
Key Components:
o Daemon : impalad – low latency daemon running on each datanode
(mutually exclusive of MapReduce)
o StateStore: Impala StateStore – high throughput scheduler which stores
state of daemon running on nodes, also provides subscription service, thrift
mode, failure detection (also for HA)
o Shell: Impala Shell - standard querying interface
-
Production Ready Version
What’s not there (in current release):
-
Resource Manager
-
User Defined Function support for
Hive
-
Delay Scheduling
-
Manual query aborts
-
DDL Statements
-
Procedures, Scheduled jobs
What to expect (in current release):
-
SQL-92 features of Hive Query
language
-
Low latency start and fetch
-
ODBC support, Command Line
Interface for querying
-
Kerberos authentication
-
Possible lesser cost of ownership
than licensed counterparts like Hadapt
What not to expect (in current release):
-
RDBMS speed in all operations
-
Trevni support which will bring in
support for columnar binary storage and
more compression options
-
Lot
of consistent documentation
When to expect new features:
-
Q1’2013 for a more stable version
-
More feature loaded version in
CDH5
Which are the closest product match:
-
RAD labs Sparrow BatchSampling
How to update yourself
Where to download:
- Documentation
-
Code
selective snapshot below:





comments: