Skip to main content

Caching mechanisms in data visualization software

As the golden boy of data visualization, Tableau Software (NYSE: DATA) made its debut on the stock exchange, it ended day 1 at whopping market capitalization of 2.9 billion $. This emphasized once more hadoopsphere’s prediction that visualization technologies will hold fore in the year 2013.

One of the key features of the visualization software like Tableau’s has been the caching technology. Caching is used to improve overall performance and user experience of Big Data systems. Most of the visualization suites store selection states of queries in memory. When the user makes the same set of ‘selections’, the cache is leveraged to:
-          improve the response time
-          eliminate redundant retrieval
-          reduce storage requirement
-          bring down network traffic

Listed below are current key architectural trends in designing caching mechanisms for visualization software:
  • Dashboard output caching in various formats like HTML, PDF, Excel, and Flash for instant retrieval.
  • Leverage bigger memory space on 64-bit computers.
  • Automatic caching at multiple levels, including element list, metadata object, report dataset, XML definition, document output, and database connection caching.
  • Persist the cache results to disk.
  • Share caches across all cluster nodes.
  • Caching data locally on the mobile devices.
  • Providing admin configuration options for maximum cache size, cache wipe frequency, and options for automatically rebuilding new caches.
  • Securing and wiping the cache on device to prevent data theft.

While the above may have given a good idea of what’s steaming up here, we go down further to get into some real details. We look at two variant proposals of caching mechanism proposed by Tableau and QlikTech. (Please note that these may be used in a divergent manner in current versions of these software.)

(1)   Tableau –

In the published architecture, Tableau uses a data interpreter module which consists of:
- query descriptions for querying databases;
- query cache for storing database query results;
- pane-data-cache for storing separate data structure for each pane in a visual table that is displayed by visual interpreter module.

In one of the implementations for presenting a visual
representation of a query, a determination is made if the query already exists in the query cache. If it exists already, the result is retrieved from the query cache. However, if it does not exist, the target database is queried. "If such a database query is made, data interpreter module will formulate the query in a database-specific manner. For example, in certain instances, data interpreter module will formulate an SQL query whereas in other instances, data interpreter module will formulate an MDX query." Thereafter, the results of the query are added to the query cache.

The data retrieved in the processing steps above can contain data for a set of panes. When this is the case, the data that is fetched above is partitioned into a separate data structure for each pane using a grouping transform that is conceptually the same as a "GROUP BY" in SQL except separate data structures are created for each group rather than performing aggregation. Each output data structure from group-tsf is added to pane-data-cache for later use by visual interpreter module.

In the visual interpreter module, the pane graphic is created using a described specification. Primitive objects like bars in a barchart and their encoding objects for visual properties are created. Thereafter, the per-pane transform to describe tuples display order is applied. The data for pane is retrieved from pane-data-cache using p-lookup. The data (which may be a subset of tuples retrieved from query) is thus bound to a pane.

Source: above text derived from patent US8140586 B2

In the current implementations of Tableau, you may select cache refresh frequency from any of the following options:
·         Refresh Less Often—Data is cached and reused whenever it is available regardless of when it was added to the cache. This option minimizes the number of queries sent to the database. Select this option when data is not changing frequently. Refreshing less often may improve performance.
·         Balanced—Data is removed from the cache after a specified number of minutes. If the data has been added to the cache within the specified time range the cached data will be used, otherwise new data will be queried from the database.
·         Refresh More Often—The database is queried each time the page is loaded. The data is still cached and will be reused until the user reloads the page. This option will ensure users see the most up to date data; however, it may decrease performance.

(2)   QlikTech

In a published architecture in patent US8244741 B2 which utilizes a unique two step caching architecture, QlikTech states in the abstract:

A method for retrieving calculation results, wherein a first input or selection causes a first calculation on a database to produce an intermediate result, and a second selection or input causes a second calculation on the intermediate result, producing a final result. These results are cached with digital fingerprint identifiers.

A first identifier is calculated from the first selection, and a second identifier is calculated from the second selection and the intermediate result. The first identifier and intermediate result are associated and cached, while the second identifier and final result are associated and cached.

The final result may be then retrieved using the first and second selections or inputs by recalculating the first identifier and searching the cache for the first identifier associated with the intermediate result. Upon locating the intermediate result, the second identifier may be recalculated to locate the cached second identifier associated with the final result.

To sum up, we observe that as we move from the traditional analysts to the age of data scientists, the tool makers have alongside been making constant innovations. Visualization is one space which fortunately has been beating the tide. As John Sviokla commented in Harvard Business Review , “So, the good news is that even in a world of information surplus, we can draw upon deep human habits on how to visualize information to make sense of a dynamic reality.”

all images taken from Tableau, QlikView website and patents


  1. Nice work, your blog is concept oriented ,kindly share more blogs like this Tableau Online Course


Post a Comment

Popular articles

5 online tools in data visualization playground

While building up an analytics dashboard, one of the major decision points is regarding the type of charts and graphs that would provide better insight into the data. To avoid a lot of re-work later, it makes sense to try the various chart options during the requirement and design phase. It is probably a well known myth that existing tool options in any product can serve all the user requirements with just minor configuration changes. We all know and realize that code needs to be written to serve each customer’s individual needs. To that effect, here are 5 tools that could empower your technical and business teams to decide on visualization options during the requirement phase. Listed below are online tools for you to add data and use as playground. 1)      Many Eyes : Many Eyes is a data visualization experiment by IBM Research and the IBM Cognos software group. This tool provides option to upload data sets and create visualizations including Scatter Plot, Tree Ma

Data deduplication tactics with HDFS and MapReduce

As the amount of data continues to grow exponentially, there has been increased focus on stored data reduction methods. Data compression, single instance store and data deduplication are among the common techniques employed for stored data reduction. Deduplication often refers to elimination of redundant subfiles (also known as chunks, blocks, or extents). Unlike compression, data is not changed and eliminates storage capacity for identical data. Data deduplication offers significant advantage in terms of reduction in storage, network bandwidth and promises increased scalability. From a simplistic use case perspective, we can see application in removing duplicates in Call Detail Record (CDR) for a Telecom carrier. Similarly, we may apply the technique to optimize on network traffic carrying the same data packets. Some of the common methods for data deduplication in storage architecture include hashing, binary comparison and delta differencing. In this post, we focus o

In-memory data model with Apache Gora

Open source in-memory data model and persistence for big data framework Apache Gora™ version 0.3, was released in May 2013. The 0.3 release offers significant improvements and changes to a number of modules including a number of bug fixes. However, what may be of significant interest to the DynamoDB community will be the addition of a gora-dynamodb datastore for mapping and persisting objects to Amazon's DynamoDB . Additionally the release includes various improvements to the gora-core and gora-cassandra modules as well as a new Web Services API implementation which enables users to extend Gora to any cloud storage platform of their choice. This 2-part post provides commentary on all of the above and a whole lot more, expanding to cover where Gora fits in within the NoSQL and Big Data space, the development challenges and features which have been baked into Gora 0.3 and finally what we have on the road map for the 0.4 development drive. Introducing Apache Gora Although