Caching mechanisms in data visualization software

As the golden boy of data visualization, Tableau Software (NYSE: DATA) made its debut on the stock exchange, it ended day 1 at whopping market capitalization of 2.9 billion $. This emphasized once more hadoopsphere’s prediction that visualization technologies will hold fore in the year 2013.

One of the key features of the visualization software like Tableau’s has been the caching technology. Caching is used to improve overall performance and user experience of Big Data systems. Most of the visualization suites store selection states of queries in memory. When the user makes the same set of ‘selections’, the cache is leveraged to:
-          improve the response time
-          eliminate redundant retrieval
-          reduce storage requirement
-          bring down network traffic

Listed below are current key architectural trends in designing caching mechanisms for visualization software:
  • Dashboard output caching in various formats like HTML, PDF, Excel, and Flash for instant retrieval.
  • Leverage bigger memory space on 64-bit computers.
  • Automatic caching at multiple levels, including element list, metadata object, report dataset, XML definition, document output, and database connection caching.
  • Persist the cache results to disk.
  • Share caches across all cluster nodes.
  • Caching data locally on the mobile devices.
  • Providing admin configuration options for maximum cache size, cache wipe frequency, and options for automatically rebuilding new caches.
  • Securing and wiping the cache on device to prevent data theft.

While the above may have given a good idea of what’s steaming up here, we go down further to get into some real details. We look at two variant proposals of caching mechanism proposed by Tableau and QlikTech. (Please note that these may be used in a divergent manner in current versions of these software.)

(1)   Tableau –

In the published architecture, Tableau uses a data interpreter module which consists of:
- query descriptions for querying databases;
- query cache for storing database query results;
- pane-data-cache for storing separate data structure for each pane in a visual table that is displayed by visual interpreter module.

In one of the implementations for presenting a visual
representation of a query, a determination is made if the query already exists in the query cache. If it exists already, the result is retrieved from the query cache. However, if it does not exist, the target database is queried. "If such a database query is made, data interpreter module will formulate the query in a database-specific manner. For example, in certain instances, data interpreter module will formulate an SQL query whereas in other instances, data interpreter module will formulate an MDX query." Thereafter, the results of the query are added to the query cache.

The data retrieved in the processing steps above can contain data for a set of panes. When this is the case, the data that is fetched above is partitioned into a separate data structure for each pane using a grouping transform that is conceptually the same as a "GROUP BY" in SQL except separate data structures are created for each group rather than performing aggregation. Each output data structure from group-tsf is added to pane-data-cache for later use by visual interpreter module.

In the visual interpreter module, the pane graphic is created using a described specification. Primitive objects like bars in a barchart and their encoding objects for visual properties are created. Thereafter, the per-pane transform to describe tuples display order is applied. The data for pane is retrieved from pane-data-cache using p-lookup. The data (which may be a subset of tuples retrieved from query) is thus bound to a pane.

Source: above text derived from patent US8140586 B2

In the current implementations of Tableau, you may select cache refresh frequency from any of the following options:
·         Refresh Less Often—Data is cached and reused whenever it is available regardless of when it was added to the cache. This option minimizes the number of queries sent to the database. Select this option when data is not changing frequently. Refreshing less often may improve performance.
·         Balanced—Data is removed from the cache after a specified number of minutes. If the data has been added to the cache within the specified time range the cached data will be used, otherwise new data will be queried from the database.
·         Refresh More Often—The database is queried each time the page is loaded. The data is still cached and will be reused until the user reloads the page. This option will ensure users see the most up to date data; however, it may decrease performance.

(2)   QlikTech

In a published architecture in patent US8244741 B2 which utilizes a unique two step caching architecture, QlikTech states in the abstract:

A method for retrieving calculation results, wherein a first input or selection causes a first calculation on a database to produce an intermediate result, and a second selection or input causes a second calculation on the intermediate result, producing a final result. These results are cached with digital fingerprint identifiers.

A first identifier is calculated from the first selection, and a second identifier is calculated from the second selection and the intermediate result. The first identifier and intermediate result are associated and cached, while the second identifier and final result are associated and cached.

The final result may be then retrieved using the first and second selections or inputs by recalculating the first identifier and searching the cache for the first identifier associated with the intermediate result. Upon locating the intermediate result, the second identifier may be recalculated to locate the cached second identifier associated with the final result.

To sum up, we observe that as we move from the traditional analysts to the age of data scientists, the tool makers have alongside been making constant innovations. Visualization is one space which fortunately has been beating the tide. As John Sviokla commented in Harvard Business Review , “So, the good news is that even in a world of information surplus, we can draw upon deep human habits on how to visualize information to make sense of a dynamic reality.”

all images taken from Tableau, QlikView website and patents

No comments:

Post a Comment