As the golden boy of data visualization, Tableau Software (NYSE: DATA) made its debut on the stock exchange, it ended day 1 at whopping market
capitalization of 2.9 billion $. This emphasized once more hadoopsphere’s prediction that visualization technologies will hold fore in the year 2013.
One of the key features of the visualization software like Tableau’s
has been the caching technology. Caching is used to improve overall performance
and user experience of Big Data systems. Most of the visualization suites store
selection states of queries in memory. When the user makes the same set of ‘selections’,
the cache is leveraged to:
-
improve the response time
-
eliminate redundant retrieval
-
reduce storage requirement
-
bring down network traffic
Listed below are current key architectural trends in
designing caching mechanisms for visualization software:
- Dashboard output caching in various formats like HTML, PDF, Excel, and Flash for instant retrieval.
- Leverage bigger memory space on 64-bit computers.
- Automatic caching at multiple levels, including element list, metadata object, report dataset, XML definition, document output, and database connection caching.
- Persist the cache results to disk.
- Share caches across all cluster nodes.
- Caching data locally on the mobile devices.
- Providing admin configuration options for maximum cache size, cache wipe frequency, and options for automatically rebuilding new caches.
- Securing and wiping the cache on device to prevent data theft.
While the above may have given a good idea of what’s
steaming up here, we go down further to get into some real details. We look at
two variant proposals of caching mechanism proposed by Tableau and QlikTech. (Please
note that these may be used in a divergent manner in current versions of these
software.)
(1) Tableau –
In the published architecture,
Tableau uses a data interpreter module which consists of:
- query descriptions for querying
databases;
- query cache for storing database
query results;
- pane-data-cache for storing
separate data structure for each pane in a visual table that is displayed by
visual interpreter module.
In one of the implementations for
presenting a visual
The data retrieved in the processing steps above can contain data for a set of panes. When this is the case, the data that is fetched above is partitioned into a separate data structure for each pane using a grouping transform that is conceptually the same as a "GROUP BY" in SQL except separate data structures are created for each group rather than performing aggregation. Each output data structure from group-tsf is added to pane-data-cache for later use by visual interpreter module.
In the visual interpreter module, the pane graphic is created using a described specification. Primitive objects like bars in a barchart and their encoding objects for visual properties are created. Thereafter, the per-pane transform to describe tuples display order is applied. The data for pane is retrieved from pane-data-cache using p-lookup. The data (which may be a subset of tuples retrieved from query) is thus bound to a pane.
Source: above text derived from
patent US8140586 B2
| In the current implementations of Tableau, you may select cache refresh
frequency from any of the following options:
·
Refresh Less Often—Data
is cached and reused whenever it is available regardless of when it was added
to the cache. This option minimizes the number of queries sent to the
database. Select this option when data is not changing frequently. Refreshing
less often may improve performance.
·
Balanced—Data
is removed from the cache after a specified number of minutes. If the data
has been added to the cache within the specified time range the cached data
will be used, otherwise new data will be queried from the database.
·
Refresh More Often—The
database is queried each time the page is loaded. The data is still cached
and will be reused until the user reloads the page. This option will ensure
users see the most up to date data; however, it may decrease performance.
|
(2) QlikTech
In a published architecture in
patent US8244741 B2 which utilizes a unique two step caching architecture, QlikTech states
in the abstract:
“A method for retrieving
calculation results, wherein a first input or selection causes a first
calculation on a database to produce an intermediate result, and a second
selection or input causes a second calculation on the intermediate result,
producing a final result. These results are cached with digital fingerprint
identifiers.
A first identifier is calculated
from the first selection, and a second identifier is calculated from the
second selection and the intermediate result. The first identifier and intermediate
result are associated and cached, while the second identifier and final
result are associated and cached.
The final result may be then
retrieved using the first and second selections or inputs by recalculating
the first identifier and searching the cache for the first identifier
associated with the intermediate result. Upon locating the intermediate
result, the second identifier may be recalculated to locate the cached second
identifier associated with the final result.”
|
To sum up, we observe that as we move from the traditional
analysts to the age of data scientists, the tool makers have alongside been making constant innovations. Visualization is one space which fortunately has been beating the tide. As John
Sviokla commented in Harvard Business Review , “So, the good news is that even in a world of
information surplus, we can draw upon deep human habits on how to visualize
information to make sense of a dynamic reality.”
------------------------------------------
all images taken from Tableau, QlikView website and patents



















