Skip to main content

Hadoop High 5 with IBM's Anjul Bhambhri

In the thought leadership series called 'Hadoop High 5', we ask leaders in big data and hadoop ecosystem on the vision and the future path. Continuing this series, we asked Anjul Bhambhri five questions. Anjul Bhambhri is the Vice President of Big Data at IBM. She was previously the Director of IBM Optim application and data life cycle management tools. She is a seasoned professional with over twenty-five years in the database industry. Over this time, Anjul has held various engineering and management positions at IBM, Informix and Sybase. Prior to her assignment in tools, Anjul spearheaded the development of XML capabilities in IBM's DB2 database server. She is a recipient of the YWCA of Silicon Valley's “Tribute to Women in Technology” award for 2009. Anjul holds a degree in Electrical Engineering. 
Click on the questions to read the response.

1. Tell us about your journey with big data and Hadoop so far.

IBM has invested heavily in Hadoop to-date, and we’ll continue to do so. We see it as a key technology that solves a variety of problems for our customers. InfoSphere BigInsights is IBM's distribution of Hadoop. It incorporates 100% standard open-source components, but we’ve included enterprise-grade features, which is very important to our client base. We’ve done this in a way that preserves the openness of the platform, but also provides significant advantages, helping customers deploy solutions faster and more cost-efficiently. Each one of our enterprise grade features is an opt-in for customers. They can choose to remain purely on the open source capabilities if they choose to do so, and we provide support per IBM standard support models. I would like to point out that this support model provides a longer longevity than available from other pure play open source vendors.  

We have also been active contributors in specific projects that are of value to our enterprise customer base like HBase, metadata management, Security, encryption, in addition to a number of bug fixes in various projects of the ecosystem. We have also brought the power of Hadoop to the Power and Z Mainframe platforms.

A good example of our commitment is Big SQL -- IBM’s ANSI compliant SQL on Hadoop. Big SQL leverages more than 30 years of experience in database engineering. Big SQL works natively on Hadoop data sources, and interoperates seamlessly with open source tools. This is a big win for customers since they can use their existing investments in SQL tools, including Cognos, SPSS and other third-party tools. They can gain seamless access to data, regardless of where it’s stored. They don't need to re-write their queries into someone's less-capable dialect of the SQL. This not only saves time and money -- it simplifies the environment and helps customers deploy solutions faster and more reliably.

IBM believes that Hadoop has managed to become the heterogenous compute platform that allows us to run a variety of applications, especially analytics. While a number of initial implementations focused on basic warehouse style processing, data lakes, and optimizations of transformation workloads, they are now graduating to higher level analytics using polystructured data. . New geospatial data is being brought in, for example, to analyze accident proclivity in specific zip codes based on road conditions. Traffic information is being integrated with vehicle wear and tear information to influence insurance rating policies. Such types of analytics require a large amount of both compute and storage and Hadoop has made it possible.

2. One of the common questions is that does IBM Watson complement or substitute Hadoop ecosystem?

IBM Watson specializes in the Cognitive compute, Q&A style analytics and solutions. Hadoop is an essential underpinning for such a system, especially in an enterprise context. Data in enterprises is imperfect, and needs a set of curation and aggregation steps to ensure that the Watson magic can be applied. Information extraction, Entity integration are 2 key elements that go into this curation process. IBM’s distribution of Hadoop, BigInsights, provides comprehensive Text and machine learning capabilities, that are used in this curaton process by Watson. If I'm writing an application to parse human and human call-center conversations, for example, or an application to process social media feeds or log files, I can build and deploy these applications much faster and get better results, because IBM has already done the heavy lifting in its text and machine learning engins.. This means customers can start solving real business problems faster, and glean insights more easily, from their unstructured data.
We see the actual capture and storage of data in Hadoop as the easy part of the problem. Anybody can do this. We're focused on the analytic tooling that can help customers get value out of the information they're capturing.

3. Among the various big data use cases that you have implemented with customers, which one really made you say ‘wow’?

Organizations today are overwhelmed with Big Data. In fact, the world is generating more than 2.5 billion gigabytes of data every day, and 80 percent of it is “unstructured”—everything from images, video and audio to social media and a blizzard of impulses from embedded sensors and distributed devices. Applying analytics to data can help businesses handle the onslaught of information to help make better business decisions, but most importantly, even saves lives.
UCLA, for instance, is relying on breakthrough big data technology to help patients with traumatic brain injuries. At UNC Healthcare, doctors are using a big data and Smarter Care solution to identify high-risk patients, understand in context what’s causing them to be hospitalized, and then take preventative action.
Scientists and engineers at IBM continue to push the boundaries of science and technology with the goal of creating systems that sense, learn, reason and interact with people in new ways. It’s aimed at making the relationship between human and computer more natural so that increasingly, computers will offer advice, rather than waiting for commands.

4. Looking over the horizon, where do you see Hadoop market heading for in next 3 years’ time frame?

The success of open source Hadoop means organizations today can rethink their approach to how they handle information. Rather than taking the time to decide what information to retain and what to discard, they can keep all the data in Hadoop in case it’s need, which is much more cost efficient. With access to more data, management can get a better understanding of customers, their needs and how they use products and services.
This is pretty impressive, considering few people even knew about Hadoop five years ago.

We believe that the focus for Hadoop will shift from data ingest and processing, and related plumbing to interactive information discovery, and collaborative analytics over the large amounts of data it encourages to be stored. We see Spark as a fundamental enabler of a new generation of analytics applications, especially because of its unified programming and application development model, as well as its unique approach to in-memory processing.
Hadoop will continue to be the handle the heavy batch reporting and analytics workloads as well as become the next generation warehousing platform. However, we believe that the combination of Spark and Hadoop executing on the same substrate of storage and compute, will solve the one fundamental problem of conventional warehousing – make it accessible, actionable, and useful for the business users by spawning an entirely new set of analytics applications.
For this to happen, there needs to be a strong tool chain, that enables every business analyst into a data scientist over time. It is also imperative that we have reference architectures and frameworks that allow for standardizations between applications, so that they can exchange and collaborate with one another.

5. Given some super powers for a day, what you would like to do with it?

I have always believed in using technology to bring fundamental change in people’s lives. Big data has the promise to do such changes.
As a woman and a mother, I hold dear to my heart, the education of children all the way through college. With my superpowers, I would create a set of applications and advanced analytics built on Hadoop and Spark that would help teachers to understand the student needs, The set of applications would also help teachers identify drop off indicators from universities, provide better counseling what courses and majors that fit the strength of each individual. The applications will use Watson to help students to attend the schools that best fit them, and the right level of financial assistance so that they can get into the workforce with little debt, if at all.
If I can get all of that done in a day, am sure the next day will have a new dawn!


  1. IBM and Spark - "We see Spark as a fundamental enabler of a new generation of analytics applications,".

    very interesting!!

  2. Thumbs up to 'women in data' at leadership position.


Post a Comment

Popular articles

5 online tools in data visualization playground

While building up an analytics dashboard, one of the major decision points is regarding the type of charts and graphs that would provide better insight into the data. To avoid a lot of re-work later, it makes sense to try the various chart options during the requirement and design phase. It is probably a well known myth that existing tool options in any product can serve all the user requirements with just minor configuration changes. We all know and realize that code needs to be written to serve each customer’s individual needs. To that effect, here are 5 tools that could empower your technical and business teams to decide on visualization options during the requirement phase. Listed below are online tools for you to add data and use as playground. 1)      Many Eyes : Many Eyes is a data visualization experiment by IBM Research and the IBM Cognos software group. This tool provides option to upload data sets and create visualizations including Scatter Plot, Tree Ma

Data deduplication tactics with HDFS and MapReduce

As the amount of data continues to grow exponentially, there has been increased focus on stored data reduction methods. Data compression, single instance store and data deduplication are among the common techniques employed for stored data reduction. Deduplication often refers to elimination of redundant subfiles (also known as chunks, blocks, or extents). Unlike compression, data is not changed and eliminates storage capacity for identical data. Data deduplication offers significant advantage in terms of reduction in storage, network bandwidth and promises increased scalability. From a simplistic use case perspective, we can see application in removing duplicates in Call Detail Record (CDR) for a Telecom carrier. Similarly, we may apply the technique to optimize on network traffic carrying the same data packets. Some of the common methods for data deduplication in storage architecture include hashing, binary comparison and delta differencing. In this post, we focus o

In-memory data model with Apache Gora

Open source in-memory data model and persistence for big data framework Apache Gora™ version 0.3, was released in May 2013. The 0.3 release offers significant improvements and changes to a number of modules including a number of bug fixes. However, what may be of significant interest to the DynamoDB community will be the addition of a gora-dynamodb datastore for mapping and persisting objects to Amazon's DynamoDB . Additionally the release includes various improvements to the gora-core and gora-cassandra modules as well as a new Web Services API implementation which enables users to extend Gora to any cloud storage platform of their choice. This 2-part post provides commentary on all of the above and a whole lot more, expanding to cover where Gora fits in within the NoSQL and Big Data space, the development challenges and features which have been baked into Gora 0.3 and finally what we have on the road map for the 0.4 development drive. Introducing Apache Gora Although