Caching mechanisms in data visualization software


As the golden boy of data visualization, Tableau Software (NYSE: DATA) made its debut on the stock exchange, it ended day 1 at whopping market capitalization of 2.9 billion $. This emphasized once more hadoopsphere’s prediction that visualization technologies will hold fore in the year 2013.

One of the key features of the visualization software like Tableau’s has been the caching technology. Caching is used to improve overall performance and user experience of Big Data systems. Most of the visualization suites store selection states of queries in memory. When the user makes the same set of ‘selections’, the cache is leveraged to:
-          improve the response time
-          eliminate redundant retrieval
-          reduce storage requirement
-          bring down network traffic

Listed below are current key architectural trends in designing caching mechanisms for visualization software:
  • Dashboard output caching in various formats like HTML, PDF, Excel, and Flash for instant retrieval.
  • Leverage bigger memory space on 64-bit computers.
  • Automatic caching at multiple levels, including element list, metadata object, report dataset, XML definition, document output, and database connection caching.
  • Persist the cache results to disk.
  • Share caches across all cluster nodes.
  • Caching data locally on the mobile devices.
  • Providing admin configuration options for maximum cache size, cache wipe frequency, and options for automatically rebuilding new caches.
  • Securing and wiping the cache on device to prevent data theft.


While the above may have given a good idea of what’s steaming up here, we go down further to get into some real details. We look at two variant proposals of caching mechanism proposed by Tableau and QlikTech. (Please note that these may be used in a divergent manner in current versions of these software.)

(1)   Tableau –

In the published architecture, Tableau uses a data interpreter module which consists of:
- query descriptions for querying databases;
- query cache for storing database query results;
- pane-data-cache for storing separate data structure for each pane in a visual table that is displayed by visual interpreter module.

In one of the implementations for presenting a visual
representation of a query, a determination is made if the query already exists in the query cache. If it exists already, the result is retrieved from the query cache. However, if it does not exist, the target database is queried. "If such a database query is made, data interpreter module will formulate the query in a database-specific manner. For example, in certain instances, data interpreter module will formulate an SQL query whereas in other instances, data interpreter module will formulate an MDX query." Thereafter, the results of the query are added to the query cache.

The data retrieved in the processing steps above can contain data for a set of panes. When this is the case, the data that is fetched above is partitioned into a separate data structure for each pane using a grouping transform that is conceptually the same as a "GROUP BY" in SQL except separate data structures are created for each group rather than performing aggregation. Each output data structure from group-tsf is added to pane-data-cache for later use by visual interpreter module.

In the visual interpreter module, the pane graphic is created using a described specification. Primitive objects like bars in a barchart and their encoding objects for visual properties are created. Thereafter, the per-pane transform to describe tuples display order is applied. The data for pane is retrieved from pane-data-cache using p-lookup. The data (which may be a subset of tuples retrieved from query) is thus bound to a pane.

Source: above text derived from patent US8140586 B2

In the current implementations of Tableau, you may select cache refresh frequency from any of the following options:
·         Refresh Less Often—Data is cached and reused whenever it is available regardless of when it was added to the cache. This option minimizes the number of queries sent to the database. Select this option when data is not changing frequently. Refreshing less often may improve performance.
·         Balanced—Data is removed from the cache after a specified number of minutes. If the data has been added to the cache within the specified time range the cached data will be used, otherwise new data will be queried from the database.
·         Refresh More Often—The database is queried each time the page is loaded. The data is still cached and will be reused until the user reloads the page. This option will ensure users see the most up to date data; however, it may decrease performance.


(2)   QlikTech

In a published architecture in patent US8244741 B2 which utilizes a unique two step caching architecture, QlikTech states in the abstract:

A method for retrieving calculation results, wherein a first input or selection causes a first calculation on a database to produce an intermediate result, and a second selection or input causes a second calculation on the intermediate result, producing a final result. These results are cached with digital fingerprint identifiers.

A first identifier is calculated from the first selection, and a second identifier is calculated from the second selection and the intermediate result. The first identifier and intermediate result are associated and cached, while the second identifier and final result are associated and cached.

The final result may be then retrieved using the first and second selections or inputs by recalculating the first identifier and searching the cache for the first identifier associated with the intermediate result. Upon locating the intermediate result, the second identifier may be recalculated to locate the cached second identifier associated with the final result.


To sum up, we observe that as we move from the traditional analysts to the age of data scientists, the tool makers have alongside been making constant innovations. Visualization is one space which fortunately has been beating the tide. As John Sviokla commented in Harvard Business Review , “So, the good news is that even in a world of information surplus, we can draw upon deep human habits on how to visualize information to make sense of a dynamic reality.”


------------------------------------------
all images taken from Tableau, QlikView website and patents
Read more »

Anti-Patterns in Big Data convergence with M2M


It is almost an open secret right now that the next big trend in Big Data space would be the adoption in M2M and ‘Internet of Things’ arena. If any major indicators of the battle that would happen here need to be gauged, then one needs to look at major GE investment in Pivotal and Intel’s own distribution of Hadoop.

However, as the buzz goes around, there are still some M2M (machine to machine) anti-patterns which seem to be hyped as the inhibitors.
Defining Anti-pattern:
“there must be at least two key elements present …:
  • Some repeated pattern of action, process or structure that initially appears to be beneficial, but ultimately produces more bad consequences than beneficial results, and
  • An alternative solution exists that is clearly documented, proven in actual practice and repeatable.”

Let’s analyze 4 key notions and assuming them to be anti-patterns, counter them with possible solutions trade-off.

1)      The silos of Internet of Things

While it is true that the current focus is ‘Internet of Things’, the reality is we still have only ‘Intranet of things’. This means that the data exists in enterprise or organization silos only and it is rare to see devices of 1 organization talking to devices outside the boundary to devices of another organization.

  • Data Mashups – It may be a while before data will co-exist in some industry wide warehouse. Till that time, data mashups can be of excellent assistance. Therefore, it makes excellent sense to utilize the current data in silos, mash them up with other available sources and drive business insights from them.


2)     Trust among talking Machines

 ‘Trust’ is a widely debated topic is any sort of information exchange. In case of social networks, developing ‘trust’ still involves manual intervention. For instance, we share the data (status, pics, comments) on most social networks like Facebook only with trusted group of friends. The friend requests itself are accepted on social networks only from ‘trusted’ people. The same concept extends over in machine to machine requests albeit with less manual intervention and intelligence.

  • Analogous implementations – Many industries like financial services have successfully implemented trust based model for systems. Similar analogous beginnings for M2M may be easier to replicate and succeed.


3)     Skill grill

Both M2M and Big Data implementations are specialized vendor implementations. There is a scarcity of solutions, vendors and skilled work force currently for it.

  • Training - While it is a big talking point that Big Data, Hadoop skills are still far and few, the same discussion also extends over to M2M skills. As in Hadoop, where training and community outreach has been increasing penetration, so is it for M2M.


4)     Money to money

Most projects are guided by a strong monetization factor and the ROI figures dominate the proposal clearance. Currently, M2M is still in rudimentary stage and enterprise still needs to en-cash over investments being made.

  • Data as an asset- As a known business model, enterprises are known to reap rich rewards on their data assets. What may be of monetary value to one CSP (for instance), the same may be a competitive differentiator and strategic asset for the other. Consequently, blind focus on immediate returns may not yield any benefit.


While each firm may employ its own methods and strategies to deal with the M2M bubble, it is evident that Big Data technologies like Hadoop can be a key differentiator in creating the niche platform base. To that end, we will keep monitoring the space as the projects evolve over to create differentiated solutions.

----------------------------------
Top image source: Navigant Research
Read more »

Hadoop's 10 in LinkedIn's 10

LinkedIn, the pioneering professional social network has turned 10 years old. One of the hallmarks of its journey has been its technical accomplishments and significant contribution to open source, particularly in the last few years. Hadoop occupies a central place in its technical environment powering some of the most used features of desktop and mobile app. As LinkedIn enters the second decade of its existence, here is a look at 10 major projects and products powered by Hadoop in its data ecosystem.

1)      Voldemort:

Arguably, the most famous export of LinkedIn engineering, Voldemort is a distributed key-value storage system. Named after an antagonist in Harry Potter series and influenced by Amazon’s Dynamo DB, the wizardry in this database extends to its self healing features. Available in HA configuration, its layered, pluggable architecture implementations are being used for both read and read-write use cases.

2)      Azkaban:

A batch job scheduling system with a friendly UI, Azkaban aims to make batch programming easy and visually appealing. "It allows the independent pieces to be declaratively assembled into a single workflow, and for that workflow to be scheduled to run periodically...includes things like email notifications of success or failure, resource locking, retry on failure, log collection, historical job run time information, and so on.”

3)      DataFu:

DataFu is a collection of Pig UDFs (user defined functions) for data analysis on Hadoop. As the team at LinkedIn developed and refined its UDF for ‘People you may know’ and ‘Skills’ section, it compiled the well tested functions in this library. It contains functions for statistical tasks like PageRank, Variance, bag operations and set operations.

4)      Decomposer:

Decomposer is a collection of extremely large matrix decomposition algorithm implementations, in Java.  It currently contains Singular Value Decomposition implementation and the library is in process of being ‘absorbed’ into the Apache Mahout project.

5)      Kafka:

Kafka is a distributed publish-subscribe messaging system. Although at first look it may seem similar to Apache Flume, it is actually intended to be a Message Broker equivalent. “Kafka aims to unify offline and online processing by providing a mechanism for parallel load into Hadoop as well as the ability to partition real-time consumption over a cluster of machines”

6)      White Elephant:

White Elephant is used to parse Hadoop logs and provide visualization dashboard for Hadoop cluster statistics, including total task time, slots used, CPU time, and failed job counts.  White Elephant’s server is a JRuby application, also deployable on Tomcat while the data is stored in HyperSQL in-memory DB and charts rendered with Rickshaw.

7)      Helix:

Helix, built on top of Apache Zookeeper, is a generic cluster management framework for automatic management of partitioned, replicated and distributed resources hosted on a cluster of nodes. LinkedIn uses “Helix to manage our search-as-a-service clusters hosting multiple search applications, Databus, our data change capture component, and Espresso, our indexed, timeline-consistent, document-oriented data store”

8)      Norbert:

Norbert, implemented in Scala, wraps ZooKeeper and Netty and uses Protocol Buffers to provide easy cluster management and workload distribution. It is claimed to be capable of “quickly distribut(ing) a simple client/server architecture to create a highly scalable architecture capable of handling heavy traffic”

9)      Giraph:

Giraph, inspired by Google Pregel utilizes Bulk Synchronous Parallel (BSP) model for computation of graph algorithms on Hadoop clusters. Taking an active interest in this project like its social media counterparts in Facebook and Twitter, today LinkedIn has been using Giraph for social graph computations and interpretations.

10)  Avtara:

“Avatara is LinkedIn's scalable, low latency, and highly-available OLAP system for ‘sharded’ multi-dimensional queries in the time constraints of a request/response loop.” Used in "Who's Viewed My Profile?” Avtara has an offline engine that computes cubes in batch and an online engine that serves queries in real time.

(Although Pig, Avro, Zookeeper constitute key part of ecosystem, they have been skipped detailed mention here assuming them as part of core layers of Hadoop deployment.)

And, finally, if you are still not convinced on LinkedIn’s Hadoop story, here is a quick snapshot of Hadoop skilled employees in various social networks. The data has been sourced from LinkedIn profiles. An additional measure of employees associated with Apache Software Foundation has been included but it is likely that the number reflected in LinkedIn profiles may vary from actual.

Read more »

Have you used Lua for MapReduce?


Lua as a cross platform programming language has been popularly used in games and embedded systems. However, due to its excellent use for configuration, it has found wider acceptance in other user cases as well.

Lua was inspired from SOL (Simple Object Language) and DEL(Data-Entry Language)  and created by Roberto Ierusalimschy, Waldemar Celes, and Luiz Henrique de Figueiredo at the Pontifical Catholic University of Rio de Janeiro, Brazil.  Roughly translated to ‘Moon’ in Portuguese, it has found many big takers like Adobe, Nginx, Wikipedia.

Quick Overview of key aspects:
8 Basic types
nil, boolean, number, string, function, userdata, thread, and table
3 Kinds of variables
global variables, local variables and table fields
Control structures
Conditionals
- if statements
Iteration
- while
- repeat
- for
Functions
Similar to functions in other languages.
Closures: A function which returns a function
Co-routines
Similar to threads but not exactly the same;
Co-routines are collaborative
Object Oriented Programming
Tables in Lua are objects and have states like objects
Garbage Collection
Lua performs automatic memory management


A good explanation of Lua versus other scripting languages is given in a discussion chain on MediaWiki. Essentially, Lua, over the years, has been gaining wider acceptability due to it’s:
- Extensibility (through Lua or other languages like C)
- Small Size (few MBs)
- Technical USPs (for recursion, first class functions etc)
- Efficient (among the faster scripting languages)
- Portable (various platforms including Windows, Unix flavors, Playstation etc.)


A recent implementation utilizing Lua has been in the Kitten project made by Josh Wills (Cloudera) who is also the author of Apache Crunch. Much like Crunch which eases the task of invoking MapReduce jobs, Kitten simplifies YARN (aka MRv2) applications implementation as a series of patterns. Kitten is written in Java but uses Lua based configuration files for configuring, launching, and monitoring YARN applications. As part of the Lua configuration files, the resources needed by the application are specified.

As Josh writes in the Readme for Kitten project:
Kitten makes extensive use of Lua’s table type to organize information about how a YARN application should be executed… (Lua) has a number of desirable properties for the use case of configuring YARN applications, namely:
  1. It integrates well with both Java and C++. We expect to see YARN applications written in both languages, and expect that Kitten will need to support both. Having a single configuration format for both languages reduces the cognitive overhead for developers.
  2. It is a programming language, but not much of one. Lua provides a complete programming environment when you need it, but mainly stays out of your way and lets you focus on configuration.
  3. It tolerates missing values well. It is easy to reference values in a configuration file that may not be defined until much later. For example, we can specify parameters that will eventually contain the value of the master's hostname and port, but are undefined when the client application is initially configured.
That said, we fully expect that other languages (e.g., Lisp) would make excellent configuration languages for YARN applications…


Another significant experimental project for MRv1 has been Rohit Joshi’s lua-mapreduce. The project inspired by Octopy in Python has been used to demonstrate parallel execution of MapReduce tasks. Source code for this project can also be found on github.

Please note both these projects are still for development environment and we may have to wait a bit more to see successful implementations in MapReduce and Hadoop production environment.


Read more »

MRQL - a SQL on Hadoop Miracle


Recently, the Apache Incubator accepted a new query engine for Hadoop and Hama, called MRQL (pronounced miracle), which was initially developed in 2011 by Leonidas Fegaras.

MRQL (MapReduce Query Language) is a query processing and optimization system for large-scale, distributed data analysis, built on top of Apache Hadoop and Hama. MRQL has some overlapping functionality with Hive, Impala and Drill, but one major difference is that it can capture many complex data analysis algorithms that can not be done easily in those systems in declarative form. So, complex data analysis tasks, such as PageRank, k-means clustering, and matrix multiplication and factorization, can be expressed as short SQL-like queries, while the MRQL system is able to evaluate these queries efficiently.

Another difference from these systems is that the MRQL system can run these queries in BSP (Bulk Synchronous Parallel) mode, in addition to the MapReduce mode. With BSP mode, it achieves lower latency and higher speed. According to MRQL team, “In near future, MRQL will also be able to process very large data effectively fast without memory limitation and significant performance degradation in the BSP mode”.


As a simple example, the MRQL query in Figure 1 calculates the k-means clustering algorithm.
Figure 1. K-means Clustering Expressed as an MRQL Query
Figure 2. K-Means Clustering Using MR and BSP Modes for 10 steps.
Figure 2 shows the results of evaluating the K-means query using MR and BSP modes for limit (number of iterations) 10. We can see that the BSP evaluation outperforms the MR evaluation by an order of magnitude.

MRQL team also has plans to support additional distributed processing frameworks, such as Spark and OpenMPI in the future. Currently, a number of researchers and developers from various organizations, such as UT Arlington, Oracle, and Cloudera, are involved in the MRQL project. They are looking forward to your contributions.
You can find more information about MRQL at the website:


About the Author:

A creator of Apache Hama, a committer of Apache BigTop and MRQL. Currently he works at Oracle Corporation. 



If you wish to write a post on Hadoop and want to share your experience/expertise, click here.

Read more »

Hadoop use cases in the Bitcoin economy


Bitcoin as a decentralized P2P virtual currency has become a hot topic of discussion and analysis in the last few months. Due to its unregulated nature and possibility to exchange it for real life goods, services or real currency, the currency is being used for both legit and illegal trades. Further, the transaction data is publicly available online albeit with anonymous ids. In this post, lets explore the possible Bitcoin economy use cases for Hadoop ecosystem products.

Bitcoins were first suggested in 2008 by Satoshi Nakamoto. Participants of this economy acquire a Bitcoin wallet and Bitcoin addresses. The Bitcoin address send and receive Bitcoins (BTC) in a manner analogous to sending and receiving e-mails from e-mail address. The network relies on a shared transaction log called block chain where all confirmed transactions are included without any exception. The system uses a distributed consensus system called Mining that is used to confirm waiting transactions by including them in the blockchain.



Besides the known counter intuitive possibility of using Hadoop processing for mining, there are more elaborate use cases in which the products in Hadoop ecosystem can contribute positively. A few of the possible ones which can be taken up for PoC are listed below:


(1)   Determining owner entities:
Although the transaction data is available with anonymous ids only, it has been demonstrated in certain experiments to be able to club sender address to one entity and trace the transaction history of that entity. Entity here refers to an individual or group using a set of Bitcoin address. Further, combined with certain crawling data from the various websites and logs, it is possible to ascertain the real identity behind the anonymous id. For instance, in one analysis, using one of Wikileaks published Bitcoin address, it was possible to determine the other address belonging to the group. A good example of such tracing is available on blockviewer.com which utilizes Neo4j.
image source: https://github.com/thallium205/BitcoinVisualizer
Products to explore: Apache Hadoop, Apache Drill, Apache Giraph

(2)   Co-relating real life events with Bitcoin transactions and fluctuations:
Off late, there have been rumors of unrest in certain countries having linkages to Bitcoin transactions. It could be possible to analyze the Bitcoin transaction data and attribute to certain geographic entities and see co-relation with real life events.
Products to explore: Apache Hadoop, Apache Lucene, Apache Mahout
           

(3)   Securing your cluster to avoid mis-use:
Trojans have been reported to exploit the unused processing power of individual machines and network clusters to mine Bitcoins. It becomes imperative to protect your cluster against any such malware and botnets.
Products to explore: Apache Knox

(4)   Identifying illegal use sites:
There has been reported use of Bitcoins for illegal trades like drugs and narcotics. By detecting anomalous behavior in transactions along with scanning logs, it may be possible to monitor this hot trail.
Products to explore: Apache Hadoop, Apache HBase, Apache Pig, Apache Hive


While the above use cases are applicable for all currency transactions (virtual or fiat), they have been mentioned specifically for Bitcoin due to open availability of transaction data and lack of regulatory mechanism. There is a chance that the governments may crack down on this virtual currency in absence of credible monitoring tools. Here in, lies the opportunity and threat for the Bitcoin and Hadoop community to sustain this currency for legit purposes. 
Read more »

Tools and Techniques for Matrix factorization



Matrix factorization as a technique of Collaborative filtering has been the preferred choice for recommendation systems ever since Netflix million $ competition was held a few years back. Further, with the advent of news personalization, advanced search and user analytics, the concept has gained prominence. In this post, let’s explore the tools and techniques in Hadoop ecosystem supporting Matrix factorization.

For purpose of understanding matrix factorization, we tend to simplify the computation to:
            p x r Matrix B
            r x q Matrix C
for achieving:
A ~= BC
and minimize L(A,B,C) computed over N training points
where, A is a p x q input matrix and L is a loss function.

Among the various algorithmic techniques available, the following are more popular:

-          Alternating Least Squares (ALS): as popularized in Netflix contest
Step 1: Initialize matrix C by assigning the average rating for that movie as
the first row, and small random numbers for the remaining entries.
Step 2: Fix C, Solve B by minimizing the objective function (the sum of
squared errors);
Step 3: Fix B, solve C by minimizing the objective function similarly;
Step 4: Repeat Steps 2 and 3 until a stopping criterion is satisfied.
           
- Expectation Maximization: used in large scale collaborative filtering and demonstrated experimentation on Google News.
The EM algorithm is an efficient iterative procedure to compute the Maximum Likelihood (ML) estimate in the presence of missing or hidden data…
Each iteration of the EM algorithm consists of two processes: The E-step, and the M-step. In the expectation, or E-step, the missing data are estimated
given the observed data and current estimate of the model parameters. This is achieved using the conditional expectation, explaining the choice of terminology. In the M-step, the likelihood function is maximized under the assumption that the missing data are known. The estimate of the missing data from the E-step are used in lieu of the actual missing data.

- Distributed non negative matrix factorization: a re-scaled version of gradient descent as tested on Internet Explorer ‘Suggested Sites’ function. In the distributed technique leveraging NMF, the data is partitioned along the long dimension for parallelism.
Assuming all matrices can be held in shared memory, multiplicative update rules are applied to determine step sizes for each parameter.

- Stochastic Gradient Descent (SGD): a well know technique which tends to compute direction of steepest descent and then takes a step in that direction. Among the variants include:
- Partitioned SGD: distribute without using stratification and run independently and in parallel on partitions
- Pipelined SGD: based on ‘delayed update’ scheme
- Decentralized SGD: computation in decentralized and distributed fashion


Beyond MapReduce as built in Apache Hadoop, there are a few other tools/frameworks/libraries which are available in Hadoop ecosystem for implementing matrix factorization techniques:

- Apache Mahout

Machine Learning library which contains matrix factorization algorithm for collaborative filtering implemented on top of Apache Hadoop using the map/reduce paradigm.




- Apache Giraph

Open source graph processing framework inspired by Google Pregel which follows the bulk synchronous parallel model and launched typically as Hadoop job.


- Ricardo

A scalable platform for deep analytics. Ricardo combines the data management capabilities of Hadoop and Jaql with the statistical functionality provided by R.

Source: Data Intensive Analytics with Hadoop: A Look Inside (refer slide 33 onwards)

There are further advances being made in the various projects in the Apache Hadoop ecosystem. A recent incubator proposal for MRQL project also aims to leverage the BSP model of Apache Hama to compute matrix functions.

Most, if not all, techniques listed above can be effectively run in a distributed environment using MapReduce and Hadoop only also. The same has been strongly argued in a paper by Jimmy Lin. ‘MapReduce only’ approach may suffer from a few limitations like launch overheads and may tend to be categorized batch only. However, it still may score in front of multiple tools architecture in terms of avoiding integration overheads. With a strong focus on low latency products and advances being made in analytic accelerators, there is a possibility we may see more on matrix factorization and collaborative filtering in coming months.

-----------------------------------------------------------------------

top image source: wikipedia
Read more »