Skip to main content

Caching mechanisms in data visualization software

As the golden boy of data visualization, Tableau Software (NYSE: DATA) made its debut on the stock exchange, it ended day 1 at whopping market capitalization of 2.9 billion $. This emphasized once more hadoopsphere’s prediction that visualization technologies will hold fore in the year 2013.

One of the key features of the visualization software like Tableau’s has been the caching technology. Caching is used to improve overall performance and user experience of Big Data systems. Most of the visualization suites store selection states of queries in memory. When the user makes the same set of ‘selections’, the cache is leveraged to:
-          improve the response time
-          eliminate redundant retrieval
-          reduce storage requirement
-          bring down network traffic

Listed below are current key architectural trends in designing caching mechanisms for visualization software:
  • Dashboard output caching in various formats like HTML, PDF, Excel, and Flash for instant retrieval.
  • Leverage bigger memory space on 64-bit computers.
  • Automatic caching at multiple levels, including element list, metadata object, report dataset, XML definition, document output, and database connection caching.
  • Persist the cache results to disk.
  • Share caches across all cluster nodes.
  • Caching data locally on the mobile devices.
  • Providing admin configuration options for maximum cache size, cache wipe frequency, and options for automatically rebuilding new caches.
  • Securing and wiping the cache on device to prevent data theft.

While the above may have given a good idea of what’s steaming up here, we go down further to get into some real details. We look at two variant proposals of caching mechanism proposed by Tableau and QlikTech. (Please note that these may be used in a divergent manner in current versions of these software.)

(1)   Tableau –

In the published architecture, Tableau uses a data interpreter module which consists of:
- query descriptions for querying databases;
- query cache for storing database query results;
- pane-data-cache for storing separate data structure for each pane in a visual table that is displayed by visual interpreter module.

In one of the implementations for presenting a visual
representation of a query, a determination is made if the query already exists in the query cache. If it exists already, the result is retrieved from the query cache. However, if it does not exist, the target database is queried. "If such a database query is made, data interpreter module will formulate the query in a database-specific manner. For example, in certain instances, data interpreter module will formulate an SQL query whereas in other instances, data interpreter module will formulate an MDX query." Thereafter, the results of the query are added to the query cache.

The data retrieved in the processing steps above can contain data for a set of panes. When this is the case, the data that is fetched above is partitioned into a separate data structure for each pane using a grouping transform that is conceptually the same as a "GROUP BY" in SQL except separate data structures are created for each group rather than performing aggregation. Each output data structure from group-tsf is added to pane-data-cache for later use by visual interpreter module.

In the visual interpreter module, the pane graphic is created using a described specification. Primitive objects like bars in a barchart and their encoding objects for visual properties are created. Thereafter, the per-pane transform to describe tuples display order is applied. The data for pane is retrieved from pane-data-cache using p-lookup. The data (which may be a subset of tuples retrieved from query) is thus bound to a pane.

Source: above text derived from patent US8140586 B2

In the current implementations of Tableau, you may select cache refresh frequency from any of the following options:
·         Refresh Less Often—Data is cached and reused whenever it is available regardless of when it was added to the cache. This option minimizes the number of queries sent to the database. Select this option when data is not changing frequently. Refreshing less often may improve performance.
·         Balanced—Data is removed from the cache after a specified number of minutes. If the data has been added to the cache within the specified time range the cached data will be used, otherwise new data will be queried from the database.
·         Refresh More Often—The database is queried each time the page is loaded. The data is still cached and will be reused until the user reloads the page. This option will ensure users see the most up to date data; however, it may decrease performance.

(2)   QlikTech

In a published architecture in patent US8244741 B2 which utilizes a unique two step caching architecture, QlikTech states in the abstract:

A method for retrieving calculation results, wherein a first input or selection causes a first calculation on a database to produce an intermediate result, and a second selection or input causes a second calculation on the intermediate result, producing a final result. These results are cached with digital fingerprint identifiers.

A first identifier is calculated from the first selection, and a second identifier is calculated from the second selection and the intermediate result. The first identifier and intermediate result are associated and cached, while the second identifier and final result are associated and cached.

The final result may be then retrieved using the first and second selections or inputs by recalculating the first identifier and searching the cache for the first identifier associated with the intermediate result. Upon locating the intermediate result, the second identifier may be recalculated to locate the cached second identifier associated with the final result.

To sum up, we observe that as we move from the traditional analysts to the age of data scientists, the tool makers have alongside been making constant innovations. Visualization is one space which fortunately has been beating the tide. As John Sviokla commented in Harvard Business Review , “So, the good news is that even in a world of information surplus, we can draw upon deep human habits on how to visualize information to make sense of a dynamic reality.”

all images taken from Tableau, QlikView website and patents


Popular posts from this blog

Beyond NSA, the intelligence community has a big technology footprint

While all through the past few days the focus has been on NSA activities, the discussion has often veered around the technologies and products used by NSA. At the same time, a side discussion topic has been the larger technical ecosystem of intelligence units. CIA has been one of the more prolific users of Information Technology by its own admission. To that extent, CIA spinned off a venture capital firm In-Q-Tel in 1999 to invest in focused sector companies. Per Helen Coster of Fortune Magazine, In-Q-Tel (IQT) has been named “after the gadget-toting James Bond character Q”.
In-Q-Tel states on its website that “We design our strategic investments to accelerate product development and delivery for this ready-soon innovation, and specifically to help companies add capabilities needed by our customers in the Intelligence Community”. To that effect, it has made over 200 investments in early stage companies for propping up products. Being a not-for-profit group, unlike Private Venture capi…

Data deduplication tactics with HDFS and MapReduce

As the amount of data continues to grow exponentially, there has been increased focus on stored data reduction methods. Data compression, single instance store and data deduplication are among the common techniques employed for stored data reduction.
Deduplication often refers to elimination of redundant subfiles (also known as chunks, blocks, or extents). Unlike compression, data is not changed and eliminates storage capacity for identical data. Data deduplication offers significant advantage in terms of reduction in storage, network bandwidth and promises increased scalability.
From a simplistic use case perspective, we can see application in removing duplicates in Call Detail Record (CDR) for a Telecom carrier. Similarly, we may apply the technique to optimize on network traffic carrying the same data packets.
Some of the common methods for data deduplication in storage architecture include hashing, binary comparison and delta differencing. In this post, we focus on how MapReduce and…

Top Big Data Influencers of 2015

2015 was an exciting year for big data and hadoop ecosystem. We saw hadoop becoming an essential part of data management strategy of almost all major enterprise organizations. There is cut throat competition among IT vendors now to help realize the vision of data hub, data lake and data warehouse with Hadoop and Spark.
As part of its annual assessment of big data and hadoop ecosystem, HadoopSphere publishes a list of top big data influencers each year. The list is derived based on a scientific methodology which involves assessing various parameters in each category of influencers. HadoopSphere Top Big Data Influencers list reflects the people, products, organizations and portals that exercised the most influence on big data and ecosystem in a particular year. The influencers have been listed in the following categories:

AnalystsSocial MediaOnline MediaProductsTechiesCoachThought LeadersClick here to read the methodology used.

Analysts:Doug HenschenIt might have been hard to miss Doug…