In this post,
let us look at 3 real life indexing use cases. While Hadoop is commonly used for
distributed batch index building, it is desirable to optimize the index capability
in near real time. We look at some practical real life implementations where
the engineers have successfully worked out their technology stack combinations using different products.
(1) Near Real Time index at eBay:
The first use
case looks at eBay where HBase is used with a novel approach for building a Near Real
Time search index:
- Building a full index takes
hours due to data-set size
- # of items changed every
minute are much less
- Identify updates in time
window t1 – t2 (Timerange scan)
- Build a ‘mini index’ only on
last X minutes of changes using Map-Reduce
- Mini indices are copied and
consumed in near real time by query servers
- (HBase) Column Family to
track last modified time
- Utilize ‘time range scan’
feature of HBase
|
(2) Distributed indexing strategy at Trovit:
Trovit is a search engine
for classified ads of real estate, jobs, cars and vacation rentals. It seems to
have arrived at a right mix of Storm, HDFS, HBase and Zookeeper for its
architecture. However, particularly, with regards to distributed index strategy,
it invokes:
- 2 phases indexing
(2 sequential MapReduce jobs) comprising of :
-- Partial indexing:
Generate lots of “micro indexes” per each monolithic or sharded index (MapRduce + Embedded Solr + HDFS)
-- Merge: Groups all the “micro indexes” and merge them to get the production data (Luceneà HDFS) |
(3) Incremental Processing by Google’s Percolator:
The topic would have been incomplete without referring to Google’s Percolator paper
which describes a technique for incremental update of index with BigTable.
A Percolator system
consists of three binaries that run on every machine in the cluster: a
Percolator worker, a Bigtable [9] tablet server, and a GFS [20] chunkserver. All observers are linked into
the Percolator worker, which scans the Bigtable for changed columns
(“notifications”) and invokes the corresponding observers as a function call
in the worker process. The observers perform transactions by sending
read/write RPCs to Bigtable tablet servers, which in turn send read/write
RPCs to GFS chunkservers.
|
Taking cue from this
implementation, many variations have been worked out by engineers while
leveraging Hadoop HDFS in combination with HBase, Storm and/or Hive.
--------------------------------------------------
top image source: freedigitalphotos.net
--------------------------------------------------
top image source: freedigitalphotos.net
Comments
Post a Comment