Skip to main content

Hadoop High 5 with Cloudera’s Mike Olson

We asked Mike Olson, Chairman and Chief Strategy Officer of Cloudera 5 questions about Hadoop. Mike co-founded Cloudera in the summer of 2008. He served as its CEO until mid-2013, when he moved to his current Chairman and CSO role. An engineer by training, he has held a variety of technical and business roles during his career.

1. Tell us about your journey with Cloudera so far.

When Amr (Awadallah), Jeff (Hammerbacher), Christophe (Bisciglia) and I started the company in the summer of 2008, no one outside of the consumer internet space had ever heard of Hadoop. Big data wasn't yet a meme. We believed at the time that the infrastructure that Google had invented, and that Doug Cutting and Mike Cafarella had turned into an open source project, was so transformative and powerful that it could change the way that enterprises thought about data. We believed that data, at scale, in enormous variety, posed a huge challenge that the market hadn't yet recognized. 

And we thought we could do something about that. 

Fast forward about five and a half years, and you see that the original vision has come true. The incumbent vendors back then dismissed the technology and us; today, virtually all of them have incorporated Hadoop into their strategy, and into their product lines. The enterprise customers that we believed needed the new platform now use it in critical production workloads, for important business applications, across many vertical markets and around the world. 

       Over the years, the character of our conversation with the market has changed
We were, early on, evangelists in the desert,
preaching the future of big data and explaining what this new Hadoop thing was, and how it worked.”
A year or two later, as that lesson sunk in, we began to talk about the features and
services that enterprise needed - security, authentication, reliability, high availability. Twelve or eighteen months later, the story we told was applications and tools -- how to get at your data, what to do with it.

These days, the conversation is first, and sometimes only, about solving real business problems. How can you turn data you have into answers you need to build better products, improve business operations, keep customers happy? 

Those transitions were driven by the changing market, but also by what we learned as we worked with real customers on real problems. The changing conversation built on the changing substrate - as Hadoop the software evolved, became more consumable, got more reliable, we could shift the message from the intricacy of the plumbing to business value.

That's the market and the evolution of our engagement with it. We've co-evolved, really; we were first in the space, and while ours was not the only voice, I do believe that our investment in the open source community and in our customers created much of the buzz - bordering on hype - we see today. Certainly the number of companies in the market, and the ubiquity of Hadoop-based product offerings from companies large and small, validates our vision and builds on Cloudera's success. 

The growth of Cloudera itself - well, that's been flat-out remarkable. Our very first office space was a borrowed conference room on Third Street in San Mateo, in the AdMob offices, in a converted shopping center. Five of us were huddled around a shared table. We are, today, about two orders of magnitude larger, with staff and offices around the world. Our customer base, our revenues, the amount of data managed by our platform, our partner network, our pipeline -- every number we track has kept pace with that growth. You dream about success when you're just getting started. You can't really imagine what it will take to get there - I certainly didn't. It takes my breath away a little bit to look back over those five years, and to look forward to the next five, and ten, and more.

2. How do you see Hadoop related technology and market evolving in the next three years?

Expansion, expansion, expansion. The platform's getting vastly more capable. It's enterprise-ready today, and it's adding highly valuable new capabilities steadily. It's taking on more, and more demanding, workloads all the time. It'll be a core piece of the infrastructure that traditional enterprises rely on. We'll see high-end, mission-critical apps built to run natively on it. We'll see spending on the platform grow into the capital-B-billions.

3. What opportunities and challenges have you encountered so far in Cloudera's large partner ecosystem?

Hadoop by itself isn't valuable. It's only useful, and it only gets adopted, when it solves real, valuable business problems for our customers. Just like relational databases, nobody buys our platform for its own sake. They have a problem, and they need an application, and the application drags the platform along. 

Two years ago we began to see first-class tools emerge in substantial numbers, running on Hadoop. Business people could use those to explore their data, investigate relationships, formulate and test out ideas about their business, their customers. They could uncover facts, and deduce truths, that they could never do before. They used more data in new ways, and learned new things. 

Tools and apps drive adoption… There was, and I believe still is, plenty of opportunity in this space; lots of innovation is happening, because there's lots of money to be made...”
Cloudera certified partners

 One year ago, we began to see native, purpose-built big data applications built on Hadoop. Why is a mobile carrier beset by customer churn? There is, it turns out, an app for that. What's your next best action in engagement with one of your retail customers? Likewise. and so on. These aren't exploratory, data discovery tools; they're full-blown applications designed to address particular, important, business questions. 

Tools and apps drive adoption. When we started five years ago, there were zero. That was a challenge. There was, and I believe still is, plenty of opportunity in this space; lots of innovation is happening, because there's lots of money to be made. I think, frankly, the big returns - to companies who build these tools and apps, and to the entrepreneurs who build those companies - are concentrated there, in the next year or two. We're a platform vendor, and we'll do well when that whole ecosystem does well. But that means we're playing a much longer game. The fast money is in stuff that end users touch!

4. What initiatives Cloudera has been taking to make Hadoop more enterprise friendly?

Every single vendor in the space shipping a Hadoop distro is shipping a bunch of Cloudera code. We have contributed to the whole ecosystem of projects - 75 committers, numerous newly-created projects, baked into the open source ecosystem and the products that all of our competitors ship.

We were first to create and ship a distro. We were the first to address new use cases by adding a new engine, HBase, to our offering. We were first to introduce an open source, interactive SQL engine to the platform, first to do the same for search. We've innovated consistently in open source to address emerging enterprise use cases.

We've invested a lot in security. We drove authentication and authorization via Kerberos across the full suite of projects. We built encryption of data on disk, and in flight among nodes in a cluster, in direct response to enterprise requirements. Some months ago we built and released as open source the Sentry project, providing user- and role-based grant/revoke support for SQL running in the platform.

It's years back, now, but we invested a lot of time and money in getting the high-availability Namenode built, and improving performance in HDFS until it was - and remains - better than proprietary variants currently on the market. Of course performance, like security, is a journey: we continue to invest, and to enhance the platform, with every release.

Configuration, monitoring, management, administration?
Access control, audit logging, compliance reporting, data lineage?
Also check
Backup? Disaster recovery?
Customers needed them, and we built them.
It's a really long list. Fundamentally, our job is to make customers successful. That means delivering the most capable product on the market, with all the ways that an enterprise needs to get at its data, to satisfy its business problems. It also means delivering a grown-up, enterprise-hardened system that can be deployed with confidence, where our customers can live up to the service level agreements they have with their customers.

And we do this continually. We make a major new feature release every twelve to eighteen months, and ship quarterly updates on a committed schedule in between. Bug fixes and patches come out by way of our proactive support organization on demand. We have continual conversations with our customers about their use cases, their deployments, and other installations, so that they can solve problems before they happen.

5. On a more individual note, can you share with us what inspires you each day?

We build software that big enterprises - banks, hospitals, insurance companies, utilities, you name it - are using to do their work better. We're a successful business because of that. That's fun.

We also see non-traditional use, in sectors that are genuinely inspiring. Organizations that we work with are exploring the onset and progress of disease, so that they can intervene earlier and more effectively, delivering better patient outcomes more affordably. Others are improving the production and distribution of food, helping to feed the burgeoning world population better and more reliably. They're producing more energy more cleanly, and managing its consumption more carefully, to mitigate atmospheric CO2 levels and to begin to control global warming.

That inspiration is episodic, though. I show up at my desk in Palo Alto or my other desk in San Francisco, or by way of United Airlines at our office in Research Triangle Park or Nashua or Tyson's Corner or Manhattan or London or Tokyo or or or...

I show up at work, and I get to spend my day with a really tremendous team. I'm honest-to-God lucky to work at Cloudera. We're changing the world, one petabyte at a time. Not many people get a chance to do that.”
Download the printable version of the interview here


Popular articles

5 online tools in data visualization playground

While building up an analytics dashboard, one of the major decision points is regarding the type of charts and graphs that would provide better insight into the data. To avoid a lot of re-work later, it makes sense to try the various chart options during the requirement and design phase. It is probably a well known myth that existing tool options in any product can serve all the user requirements with just minor configuration changes. We all know and realize that code needs to be written to serve each customer’s individual needs. To that effect, here are 5 tools that could empower your technical and business teams to decide on visualization options during the requirement phase. Listed below are online tools for you to add data and use as playground. 1)      Many Eyes : Many Eyes is a data visualization experiment by IBM Research and the IBM Cognos software group. This tool provides option to upload data sets and create visualizations including Scatter Plot, Tree Ma

Data deduplication tactics with HDFS and MapReduce

As the amount of data continues to grow exponentially, there has been increased focus on stored data reduction methods. Data compression, single instance store and data deduplication are among the common techniques employed for stored data reduction. Deduplication often refers to elimination of redundant subfiles (also known as chunks, blocks, or extents). Unlike compression, data is not changed and eliminates storage capacity for identical data. Data deduplication offers significant advantage in terms of reduction in storage, network bandwidth and promises increased scalability. From a simplistic use case perspective, we can see application in removing duplicates in Call Detail Record (CDR) for a Telecom carrier. Similarly, we may apply the technique to optimize on network traffic carrying the same data packets. Some of the common methods for data deduplication in storage architecture include hashing, binary comparison and delta differencing. In this post, we focus o

In-memory data model with Apache Gora

Open source in-memory data model and persistence for big data framework Apache Gora™ version 0.3, was released in May 2013. The 0.3 release offers significant improvements and changes to a number of modules including a number of bug fixes. However, what may be of significant interest to the DynamoDB community will be the addition of a gora-dynamodb datastore for mapping and persisting objects to Amazon's DynamoDB . Additionally the release includes various improvements to the gora-core and gora-cassandra modules as well as a new Web Services API implementation which enables users to extend Gora to any cloud storage platform of their choice. This 2-part post provides commentary on all of the above and a whole lot more, expanding to cover where Gora fits in within the NoSQL and Big Data space, the development challenges and features which have been baked into Gora 0.3 and finally what we have on the road map for the 0.4 development drive. Introducing Apache Gora Although