Skip to main content

Hadoop High 5 with MapR's John Schroeder

1. Give us a glimpse of your journey with MapR so far.

MapR Technologies has experienced major customer acquisition and corporate expansion since I cofounded the company in 2009. MapR is currently in hyper-growth mode continuing to deliver Big Data platform technology and expand global service and support. MapR launched as a company in 2011 after a two year, well-funded engineering effort, came out of stealth mode with a major strategic OEM agreement and released the industry’s only differentiated Hadoop platform that addressed reliability, availability, ease of use and performance issues required to support critical applications and enable broad adoption. 

At MapR, we’re believers in both the internal and external cloud and in 2012, the MapR Distribution became available on the Google Compute Engine and within the Amazon Elastic MapReduce service on - the two 800 pound gorilla providers of cloud services. Our technology leadership and reputation for pioneering product innovation is now enabling thousands of customers to better manage and analyze Big Data.

Today, the MapR Big Data Platform is being used in production deployments across financial services, government, healthcare, manufacturing, retail and Web 2.0 companies to drive significant business results and includes the analysis of hundreds of billions of objects a day; 90% of the Internet population monthly; and over a trillion dollars of retail transactions annually.

MapR’s latest technology accomplishments include the availability of the MapR M7 Distribution. This is providing ground breaking capabilities for Apache HBase™ NoSQL applications to enhance Big Data operations. We also recently announced the distribution of LucidWorks Search with the MapR Platform. We have also worked as part of the open source community to help incubate and spearhead the development of Apache Drill for low latency, large scale interactive SQL. On a single platform customers can now perform predictive analytics, full search and discovery, and conduct advanced database operations. MapR is pushing the envelope with performance and manageability on Hadoop at a time when Hadoop is crossing over from early adopters to full production mode in the enterprise.

2. Although Hadoop ecosystem has been pretty dynamic, what do you see as major trends for next 3 years?

During the next three years Hadoop will continue to expand capabilities and will be used in a growing number of applications, which will further establish its market leadership in Big Data analytics. Some important trends will include:

One platform for the broadest range of use cases across organizations.

Our more advanced customers are blazing the trail and have clearly communicated they want MapR to continue to push the limits for providing one Big Data platform they can provision for the broadest range of use cases across their organizations. Big Data platforms represent big investments. The return on that investment is highest when scalable decision support, operational, batch, interactive, real time, production, and ad hoc functionality are provided by the platform. Use case support requires multiple industry standard APIs including Hadoop, SQL, NoSQL, file-based, machine learning, real time and search to be seamlessly run against the platform without requiring moving data between platforms. Multi-tenancy requires governance of the compute, network and storage resources to ensure service levels. Multi-tenancy also requires securing data to preclude inadvertent leaks or purposeful attacks.

Security concerns finally addressed.

Security continues to be a barrier to adoption of external and internal cloud architectures like MapR and those provided by MapR partners Amazon and Google. In our personal lives, privacy and security are eroding as millions chronicle their lives in social media. Few people are concerned that their email and smart phone communications are being used to learn more about them, hopefully only for benign targeted marketing initiatives and not more serious intrusions. The corporate world is moving in the opposite direction with decreasing tolerance for information leaks that result in identity theft, release of health records or confidential information and result in regulation violations, lawsuits and, in the case of the federal government, threaten homeland security. During 2013 and going forward, MapR and the industry as a whole, will make significant progress securing Big Data and providing authorization, access control and encryption.

Cloud-based architectures eclipse traditional enterprise architectures.

The traditional enterprise architecture required customers to pay a premium for software and hardware resources with the promise of five 9’s and that basically they wouldn’t fail. An example of this promise is paying between $5,000 and $15,000 per terabyte for enterprise class SAN and NAS storage based on this dependability promise. These devices have spinning disks and heat generating CPUs and will fail over time. Cloud architectures pioneered by Web 2.0 companies, and the basis for MapR, target a different design center. Cloud architectures assume drives, servers, switches and software will have failures and the cloud is architected to transparently absorb these failures without any service disruption. Cloud architectures like MapR use redundancy and instant, stateful failover for all hardware and software resources. Typical cloud service hardware servers provide complete compute and storage for less than $400 per terabyte. Cloud architectures, like MapR, reliably and securely knit together 10s, 100s, and 1000s of these servers resulting in an increasingly abundant storage and compute resource that has created this “Big Data” market disruption.

HBase will become a popular platform for Blob Stores.

HBase provides a non-relational database atop the Hadoop Distributed File System (HDFS). HBase applications have several advantages in certain distributions, including the creation of a unified platform for tables and files, no need for splits or mergers, centralized configuration and logging, and consistent throughput with low latency for database operations. Some distributions also add support for high availability, data protection with mirroring and snapshots, automatic data compression, and rolling upgrades.
One application that is particularly well suited for HBase is Blob Stores, which require large databases with rapid retrieval. Blobs are binary large objects (typically images, audio clips or other multimedia objects), and storing blobs in a database enables a variety of innovative applications. One example is a digital wallet where users upload their credit card images, checks and receipts for online processing, easing banking, purchasing and lost wallet recovery.

NoSQL expands Hadoop from batch analytics to operational use cases.

With its roots in search engines, Hadoop was purpose-built for cost-effective analysis of datasets as enormous as the World Wide Web. The millions of pages of content are analyzed in batches and then served up during searches in real-time. The advances mentioned above and other improvements in Hadoop’s capabilities now make it possible to stream data into a cluster and analyze it in an interactive fashion—both in real time. Use cases like telecommunications billing, and in some cases logistics applications, have outgrown traditional relational database architectures. Telcos thirty years ago tracked a few hundred calls per week to a single home phone. Today they track data and voice transactions to a multitude of devices in one household. Hadoop and HBase provide scale and efficiency advantage for these types of applications over traditional relational data models.

3. Among the various implementations for Hadoop, which use cases are you most excited about?

Partnering with customers to build apps they couldn’t build before while also cutting their costs is exciting. Take cable providers as an example, they can offload their data warehouse processing at a fraction of the cost. We recently implemented a data warehouse offload resulting in over $25M customer savings over the next 2 years. New applications built on the new architecture gives the provider the ability to use recommendation engines for content. 

As a cable customer I’m ecstatic that by using MapR they’ll be able to provide more relevant content on demand. Soon we’ll see relevant content, based on our household viewing history, when searching for on demand content rather than today where all subscribers are presented with non-personalized choices.

4. If there is 1 tip that you would like to give to Hadoop enthusiasts, what would that be?

The growing number of organizations using Hadoop have found it to be an indispensable analytical tool capable of unlocking the value previously hidden deep in data to improve decision-making and gain a competitive edge. Its many advantages have given rise to an entire ecosystem, as well as cloud-based offerings from Google, Amazon, and other service providers. But, organizations have also discovered some serious limitations that can require considerable expertise to integrate, use and manage; and often require considerable effort to protect the data and keep the cluster of servers operational.

I recommend that companies look at all the critical dimensions that ensure Hadoop is deployable in a wide variety of enterprise environments. This means Hadoop must be easy to integrate into the enterprise, as well as more enterprise grade in its operation, performance, scalability and reliability. Specifically, Hadoop platforms should provide enterprise-grade data protection, full high availability, and the ability to integrate into existing environments.

5. Considering you had Time Machine, what would you do with it?

I’d love to go back in time and prevent historical atrocities, but I’ve been educated by too many sci-fi movies that the best intentions can result in disastrous results. Meeting our past religious and political leaders would be insightful and inspiring. That said, I’d probably go back to the final two minute of Super Bowl XLVII and ask Jim Harbaugh to let Frank Gore and Kap use their amazing skills.


Popular articles

5 online tools in data visualization playground

While building up an analytics dashboard, one of the major decision points is regarding the type of charts and graphs that would provide better insight into the data. To avoid a lot of re-work later, it makes sense to try the various chart options during the requirement and design phase. It is probably a well known myth that existing tool options in any product can serve all the user requirements with just minor configuration changes. We all know and realize that code needs to be written to serve each customer’s individual needs. To that effect, here are 5 tools that could empower your technical and business teams to decide on visualization options during the requirement phase. Listed below are online tools for you to add data and use as playground. 1)      Many Eyes : Many Eyes is a data visualization experiment by IBM Research and the IBM Cognos software group. This tool provides option to upload data sets and create visualizations including Scatter Plot, Tree Ma

Data deduplication tactics with HDFS and MapReduce

As the amount of data continues to grow exponentially, there has been increased focus on stored data reduction methods. Data compression, single instance store and data deduplication are among the common techniques employed for stored data reduction. Deduplication often refers to elimination of redundant subfiles (also known as chunks, blocks, or extents). Unlike compression, data is not changed and eliminates storage capacity for identical data. Data deduplication offers significant advantage in terms of reduction in storage, network bandwidth and promises increased scalability. From a simplistic use case perspective, we can see application in removing duplicates in Call Detail Record (CDR) for a Telecom carrier. Similarly, we may apply the technique to optimize on network traffic carrying the same data packets. Some of the common methods for data deduplication in storage architecture include hashing, binary comparison and delta differencing. In this post, we focus o

In-memory data model with Apache Gora

Open source in-memory data model and persistence for big data framework Apache Gora™ version 0.3, was released in May 2013. The 0.3 release offers significant improvements and changes to a number of modules including a number of bug fixes. However, what may be of significant interest to the DynamoDB community will be the addition of a gora-dynamodb datastore for mapping and persisting objects to Amazon's DynamoDB . Additionally the release includes various improvements to the gora-core and gora-cassandra modules as well as a new Web Services API implementation which enables users to extend Gora to any cloud storage platform of their choice. This 2-part post provides commentary on all of the above and a whole lot more, expanding to cover where Gora fits in within the NoSQL and Big Data space, the development challenges and features which have been baked into Gora 0.3 and finally what we have on the road map for the 0.4 development drive. Introducing Apache Gora Although