Exploring varied data stores with Apache MetaModel

With the wide prevalence of multiple data stores including relational, NoSQL and unstructured formats, it becomes natural to look for a library which can act as a common exploration and querying connector. Apache MetaModel is one such library which aims to provide a common interface for discovery, metadata exploration and querying of different types of data sources. With MetaModel, the user can currently query CouchDB, MongoDB, HBase, Cassandra, MySQL, Oracle DB and SQL Server among others.

"MetaModel is a library that encapsulates the differences and enhances the capabilities of different data stores. Rich querying abilities are offered to data stores that do not otherwise support advanced querying and a unified view of the data store structure is offered through a single model of the schema, tables, columns and relationships."


HadoopSphere discussed with Kasper Sorensen, VP of Apache MetaModel on the functional fitment and features of Apache MetaModel. Here is what Kasper had to say about this interesting product.

Please help us understand the purpose of Apache MetaModel and which use cases can Apache MetaModel fit in?


MetaModel was designed to make connectivity across very different types of databases, data file formats and likewise possible with just one uniform approach. We wanted an API where you can explore and query a data source using exactly the same codebase regardless of the source being an Excel spreadsheet, a relational database or something completely different.

There are quite a lot of frameworks out there that do this. But they more or less have the requirement that you need to map your source to some domain model and then you end up actually querying the domain model, not the data source itself. So if your user is the guy that is interested in the Excel sheet or the database table itself then he cannot directly relate his data with what he is getting from the framework. This is why we used the name ‘MetaModel’ – we present data based on a metadata model of the source, not based on a domain model mapping approach.

How does Apache MetaModel work?


It is important for us that MetaModel have very little infrastructure requirements - so you don’t have to use any particular container type or dependency injection framework. MetaModel is plain java oriented and if you want to use it you just instantiate objects, call methods etc.

Whenever you want to interact with data in MetaModel you need an object that implements the DataContext interface. A DataContext is a bit like a “Connection” to a database. The DataContext exposes methods to explore the metadata (schemas, tables etc.) as well as to query the actual data. The Query API is preferably type-safe Java, linked to the metadata objects, but you can also express your query as SQL if you like.

Depending on the type of DataContext, we have a few different ways of working. For SQL databases we of course want to delegate more or less the whole query to the database. On other source types we can only delegate some parts of the query to the underlying database. For instance, if you apply a GROUP BY operation to a NoSQL database, then we usually have to do the grouping part ourselves. For that we have a pluggable query engine. Finally some source types, such as CSV files or XML documents, do not have a query engine already and we wrap our own query engine around them.

Some sources can also be updated. We offer a functional interface where you pass a UpdateScript object that does required update, when it is possible according to the underlying source and with the transactional features that it may or may not support.

Give us a snapshot of what is Apache MetaModel competing with - both in open source and commercial ecosystem?


I don’t think there’s anything out there that is really a LOT like MetaModel. But there are obviously some typical frameworks that has a hint of the same.

JPA, Hibernate and so on are similar in the way that they are essentially abstracting away the underlying storage technology. But they are very different in the sense that they are modelled around a domain model, not the data source itself.

LINQ (for .NET) has a lot of similarities with MetaModel. Obviously the platform is different though and the syntax of LINQ is superior to anything we can achieve as being “just” a library. On the plus-side for MetaModel, I believe we have one of the easiest interfaces to implement if you want to make your own adaptor.

What lies ahead on the roadmap for Apache MetaModel in 2015?


We are in a period of taking small steps so that we get a feel of what the community wants. For example we just made a release where we added write/update support for our ElasticSearch module.

So the long-term roadmap is not really set in stone. But we do always want to expand the portfolio of supported data stores. I personally also would like to see MetaModel used in a few other Apache projects so maybe we need to work outside of our own community, engaging with others as well.

Why would a user explore metadata using Apache MetaModel and not connect to various data stores directly?


If you only need to connect to one data store, which already has a query engine and all - then you don’t have to use MetaModel. A key strength in MetaModel is the uniformed access to multiple data stores and similarly a weakness is in utilizing all the functionality of a source. We do have great metrics overall, but if you’re chasing to optimize the use of just one source then you can typically receive better results by going directly to it.

Another reason might be to get a query API for things such as CSV files, Spreadsheets and so on, which normally have no query capabilities. MetaModel will provide you with a convenient shortcut there.

Also testability is a strong point of MetaModel. You may write code to interact with your database, but simply test it using a POJO data store (an in-memory Java collection structure) which is fast and light-weight.

It seems Apache MetaModel does not support HDFS though it supports HBase. Any specific reason for that?


Not really, except it wasn’t requested by the community yet. But we do have interfaces that quite easily let you use e.g. the CsvDataContext on a resource (file) in HDFS. In fact a colleague of mine did that already for a separate project where a MetaModel-based application was applied to Hadoop.

My guess to answer “why” this is less interesting is that if you’re a HDFS user then you typically have such an amount of data that you anyways don’t want to use a querying framework (as MetaModel) but rather want a processing framework (such as MapReduce or so) in order to handle it.

Does Apache MetaModel support polyglot operations with multiple data stores?


Yes, a common thing that our users ask is stuff like “I have a CSV file which contains keys that are also represented in my database – can I join them?” … And with MetaModel you absolutely can. We offer a class called CompositeDataContext which basically lets you query (and explore for that matter) multiple data stores as if they were one.



Kasper Sorensen is the VP of Apache MetaModel and in his daily life he works as Principal Tech Lead at Human Inference, a Neopost company. The main products in his portfolio are Apache MetaModel and the open source Data Quality toolkit DataCleaner.
Read more »

Mars orbiter to turn into a big data appliance

Space research had been for years trying to figure out what to do with ageing satellites or with unmanned orbiters after mission completion. In a big breakthrough, the researchers have now found out a method to make the ageing satellites and orbiters run as extra-terrestrial Hadoop cluster nodes.

With the raw processing and storage resources available in these machines, they can be well used to serve as “commodity space-ware”. In a published paper, the architecture of such a space cluster is described as below.
- One of the powerful satellites would be working as Active NameNode.
- A similar configuration satellite would act as Passive NameNode to provide space class High Availability.
- The rest of the available satellites and orbiters would be assigned as DataNodes.
- In an obvious extension of Zookeeper, there would be a UFOkeeper to act as central service coordinator.

The apparent benefits are obvious:
- Utilize the available hardware in space for distributed processing without need to load new space ships with extra processing power.
- Reduce space waste by reusing the satellites rather than them lying dormant or dying on their own.
- Utilize solar power to run big data clusters rather than rely on expensive power in earth based data centers.

It has been learnt from credible sources that Mangal(g)yan, a Mars orbiter currently in space could be the first one to work on this architecture. Once the current mission of Mangal(g)yan is finished, it is possible that the orbiter could be turned into first space based big data appliance. Processing and storage on this appliance would then be available via Mangal(g)yan’s space cloud service (SCS). An incubation proposal has been submitted to Apache Software Foundation under the name Apache Space. 

Many organizations have already expressed interest in leveraging this first of its kind space cloud and with plans to use OpenStack and open source Apache Hadoop, developers can look forward to contributing to this novel project. For the developers who may have already figured it out, this post is just a figment of space fiction and an All Fools’ Day post in the spirit of first day of April. So, just smile, grin and get back to your eclipse on Windows. 
(No intentional or unintentional offence intended to any agency.)



Read more »

Governance in a data lake

The need for defining a robust data governance layer is becoming an essential requirement for an enterprise data lake. Continuing our discussion on data governance, we focus on Apache Falcon as a solution option for governing the data pipelines. HadoopSphere discussed with Srikanth Sundarrajan, VP of Apache Falcon, about the product as well as the data governance requirements. In the first part of the interview, we talked about Falcon's architecture. We further discuss the functional aspects in the interaction below. 

What lies ahead on the roadmap of Apache Falcon for 2015?

Major focus areas for Apache Falcon in 2015 and beyond:
Entity management and Instance administration dashboard – Currently CLI based administration is very limiting and the real power of the dependency information available within Falcon can’t be unlocked without an appropriate visual interface. Also entity management complexities can be cut down through a friendlier UI.
Recipes – Today Falcon supports notion of a process to perform some action over data. But there are standard and routine operations that may be applicable for a wide range of users. Falcon project is currently working on enabling this through the notion of recipe. This will enable users to convert their standard data routines into templates for reuse and more importantly some common templates can be shared across users/organizations.
Life cycle – Falcon supports standard data management functions off the shelf, however the same doesn’t cater to every user’s requirement and might require customization. Falcon team is currently working on opening this up and allowing this to be customized per deployment to cater to specific needs of a user.
Operational simplification – When Falcon becomes the de-facto platform (as is the case with some of the users), the richness of dependency information contained can be leveraged to operationally simplify how data processing is managed. Today handling infrastructure outage/maintenance or degradation or application failures can stall large pipelines causing cascading issues. Dependency information in Falcon can be used to seamlessly recover from these without any manual intervention.
Pipeline designer – This is a forward-looking capability in Falcon that enables big data ETL pipelines to be authored visually. This would generate code in language such as Apache Pig and wrap them in appropriate Falcon process and define appropriate feeds.

Can you elaborate on key desired components of big data governance regardless of tool capabilities at this stage?

Security, Quality, Provenance and Privacy are fundamental when it comes to data governance
Quality – Quality of data is one of the most critical components and there has to be convenient ways to both audit the system for data quality and also build proactive mechanism to cut out any sources of inaccuracies
Provenance – Organizations typically have complex data flows and often times it is challenging to figure the lineage of this data. To be able to get this lineage at a dataset level, field level and at a record level (in that order of importance) is very important.
Security – This is fundamental and hygiene to any data system. Authentication, Authorization and Audit trail are non-negotiable. Every user has to be authenticated and all access to data is to be authorized and audited.
Privacy – Data anonymization is one of the key techniques to conform to laws and regulation of the land. This is something that the data systems have to natively support or enable.

Why would an enterprise not prefer to use commercial tools (like Informatica) and rather use open source Apache Falcon?

Apache Falcon is a Hadoop first data management system and integrates well with standard components in the big data open source eco systems that are widely adopted. This native integration with Hadoop is what makes it a tool of choice. Apache Falcon being available under liberal APL 2.0 license and housed under ASF allows users/organizations to experiment with it easily and also enable them to contribute their extensions. Recent elevation of Apache Falcon to a top-level project also assures the users about the community driven development process adopted within the Falcon project.

If someone is using Cloudera distribution, what are the options for him?

Apache Falcon is distribution agnostic and should work (with some minor tweaks) for anyone using Apache Hadoop 2.5.0 and above along with Oozie 4.1.0.  There are plenty of users who use Apache Falcon along with HDP. One of the largest users of Apache Falcon has used it along with CDH 3 and CDH 4, and there are some users who have tried using Apache Falcon with MapR distribution as well.


Srikanth Sundarrajan works at Inmobi Technology Services, helping architect and build their next generation data management system. He is one of the key contributors to Apache Falcon and currently VP of the project. He has been involved in various projects under the Apache Hadoop umbrella including Apache Lens, Apache Hadoop-core, and Apache Oozie. He has been working with distributed processing systems for over a decade and Hadoop in particular over the last 7 years. Srikanth holds a graduate degree in Computer Engineering from University of Southern California.


Read more »

Data pipelines with Apache Falcon

Over the past few months, Apache Falcon has been gaining traction as a data governance engine for defining, scheduling, and monitoring data management policies. Apache Falcon is a feed processing and feed management system aimed at making it easier for Hadoop administrators to define their data pipelines and auto-generate workflows in Apache Oozie.

HadoopSphere interacted with Srikanth Sundarrajan, VP of Apache Falcon, to understand the usage and functional intricacies of the product. This is what Srikanth told us:

What is the objective of Apache Falcon and what use cases does it fit in?


Apache Falcon is a tool focused on simplifying data and pipeline management for large-scale data, particularly stored and processed through Apache Hadoop. Falcon system provides standard data life cycle management functions such as data replication, eviction, archival while also providing strong orchestration capabilities for pipelines. Falcon maintains dependency information between data elements and processing elements. This dependency information can be used to provide data provenance besides simplifying pipeline management.

Today Apache Falcon system is being used widely for addressing the following use cases:
Declaring and managing data retention policy
Data mirroring, replication, DR backup 
Data pipeline orchestration
Tracking data lineage & provenance
ETL for Hadoop

The original version of Apache Falcon was built back in early 2012. InMobi Technologies, one of the largest users of the system, has been using this as a de-facto platform, for managing their ETL pipelines for reporting & analytics, model training, enforcing data retention policies, data archival and large scale data movement across WAN. Another interesting application where Falcon is being used is to run identical pipelines in multiple sites on local data and results merged in a single location.

How does Apache Falcon work? Can you describe to us major components and their function?


Apache Falcon has three basic entities that it operates with:
- firstly cluster, which represent physical infrastructure, 
- secondly feed which represents a logical data element with periodicity and, 
- thirdly process, which represent a logical processing element that may depend on one or more data elements and may produce one or more data elements. 

At the core of it Apache Falcon maintains two graphs, (1) one is the entity graph and (2) the other is instance graph. The entity graph captures the dependencies between the entities and is maintained in an in-memory structure. The instance graph on the other hand maintains information about instances that have been processed and their dependencies. This is stored on a blueprint compatible graph database.
 
Falcon Embedded Mode

Falcon system integrates with Apache Oozie for all its orchestration requirements, be it a data life-cycle function or process execution. While control remains with Oozie for all the workflows and its execution, Falcon injects pre and post processing hooks into the flows, allowing Falcon to learn about the execution and completion status of each workflow executed by Oozie. Post processing hook essentially sends a notification via JMS to the Falcon server, which then uses this control signal for supporting other features such as Re-runs and CDC (change data capture).

Falcon has two modes of deployment, embedded and distributed. In Embedded model the Falcon server is complete by itself and provides all the capabilities. In distributed mode the Prism, a lightweight proxy takes center stage and provides a veneer over multiple Falcon instances, which may be run in different geographies. The prism ensures that the falcon entities are in sync across the Falcon server and provides a global view of everything happening on the system.

Falcon Distributed Mode

Falcon system has a REST based interface and a CLI over it. It integrates with Kerberos for providing authentication and minimal authorization capabilities.



Srikanth Sundarrajan works at Inmobi Technology Services, helping architect and build their next generation data management system. He is one of the key contributors to Apache Falcon and currently VP of the project. He has been involved in various projects under the Apache Hadoop umbrella including Apache Lens, Apache Hadoop-core, and Apache Oozie. He has been working with distributed processing systems for over a decade and Hadoop in particular over the last 7 years. Srikanth holds a graduate degree in Computer Engineering from University of Southern California.


Read more »

Top Big Data influencers of 2014

Big Data is an exciting technology space innovating at a pace probably never seen before. With a dynamic ecosystem and scorching pace of product development, it is easy to be left behind. However, thanks to visionaries in this ecosystem who have been able to decode the maze and set things right for us, we have been seeing successful big data use cases and implementations. 

HadoopSphere presents below its annual list of top big data influencers. This list reflects the people, products, organizations and portals that exercised the most influence on big data and ecosystem in a particular year. The influencers have been listed in the following categories:


  • Analysts
  • Online Media
  • Products
  • Social Media
  • Angels
  • Thought Leaders
The info-graphic below shows the Top Big Data influencers of 2014 as ranked by HadoopSphere. 



Analysts:

Mike Gualtieri In 2013, Mike Gualtieri of Forrester had predicted that big data will be the Time person of the year. Well, it almost came true with a big data use (or misuse) case (Edward Snowden) making it to the runner up of Time person of the year. Besides occasionally playing sorcerer, Mike has remained one of the most well respected analyst in year 2014 commanding a comprehensive vison and view for the data ecosystem.

Curt Monash If you have not been reading Curt Monash, you may have been living on an island probably. And, if you got a few incisive comments on your product, well, then you are probably part of an urban elite in this big data city. Don’t expect courtesies, just expect plain honest assessment and that too with technical depth from Curt.

Tony Baer Consistency and clarity are the forte of Tony Baer, Ovum’s principal analyst. Presume a consistent sane advice with clear cut guidance on what to expect and what not to expect from Tony. He has remained a top influencer in big data and Hadoop ecosystem consistently for another year.

Online Media:

TDWI With research papers, blogs, webinars and education events, TDWI continued to attract eye-balls and sponsors alike making it one of the top focused industry portals.

IBM IBM is a technology company but runs a media machinery of its own. Its data initiatives like IBM Data Magazine, IBM Big Data Hub, Big Data University, Developer Works, Red books combined together continued to be among top traffic getters. Though the content may be in part IBM specific, overall it did a great work of educating big data community.

DZone With ‘smart content’ for big data professionals, DZone continued to encourage community to contribute links, articles, guides and ‘refcardz’. DZone ensured both quality and good volume traffic resulting in a high influence on techies.

Products:

Spark Do we need to say anything about this obvious choice? Apache Spark has been the flavor of all seasons since 2014 beginning. With biggest open source community in big data ecosystem, it continued to define and influence the shape of future products.

Scala Although technically a programming language and not a product, Scala is listed here as it marches its way ahead to become a preferred language for big data programming. With both Apache Spark and Flink promoting it big time, the simplicity and power of the language became more obvious. We expect Scala to become one of the most powerful languages in few years.

Kafka If you need to quote an example of word-of-mouth success, here it is. Apache Kafka was developed at LinkedIn and was not a part of major Hadoop distributions till early 2015. However, still it has emerged as a preferred choice for data ingestion and has seen adoption by internet companies, financial majors and travel portals among others.

Social Media:

Ben Lorica If one has a dream twitter handle like ‘bigdata’, it may not be sheer co-incidence. It probably shows the handle owner has been talking about big data before we heard of it. Ben Lorica is the Chief Data Scientist and Director of Content Strategy for Data at O'Reilly Media, Inc and commands the ‘bigdata’ twitter handle with its impressive following.

Gregory Piatetsky-Shapiro As the President of KDnuggets, which provides analytics and data mining consulting, Gregory is a founder of KDD (Knowledge Discovery and Data mining conferences). He is one of the leading social influencers with his mentions generating huge follower interest.

Kirk Borne Dr.Kirk Borne is Professor of Astrophysics and Computational Science at George Mason University. As a data scientist and astrophysicist, he mostly talks about big data on social media and continues to attract huge follower base.

Angel Investors:

Naval Ravikant Entrepreneur and an angel investor, Naval Ravikant is co-founder of AngelList. Through this terrific forum and other offline activities, he has been drumming up the cause of many startups and taking them through the funding gates.

Data Collective DCVC (aka Data Collective) is a seed and early stage venture capital fund that invests in big data companies. Its extended team consists of more than 35 “Equity Partners,” who are notable technical founders and executives, data scientists and engineers. Some of the notable portfolio companies include Blue Data, Continuity, Elasticsearch, Citus Data.

Thought Leaders:

Mike Olson As the Chief Strategy Officer of Cloudera, Mike Olson is a leader whose vision has been driving his company and much of the Hadoop ecosystem. His unbridled passion combined with ability to foresee market dynamics makes him one of the biggest thought leaders and influencers in entire information technology arena. From marketing Hadoop to touting Impala as MPP or mentoring competitive Spark, Mike has exhibited unparalleled transformational leadership characteristics.

Merv Adrian Merv Adrian is Research VP at Gartner and the more known face of the research company in social media and event circles. Each year Gartner somehow manages to hit a rough note with the big data vendors, be it “trough of disillusionment” or “data lake fallacy” comment. However, Merv with his astute knowledge of Hadoop ecosystem, BI world and technology lifecycles has made people understand the discordant notes to apply caution, restrain and intelligence beyond the obvious hype. And, that’s what thought leaders do – create sense and path out of chaos and conflicts. Pro Tip: Merv may not agree with you but will still have you and him understand a common path.


<< Top big data influencers of 2013


Get trained in Hadoop, Spark and Big Data technologies - Enroll now
Read more »

Hadoop High 5 with IBM's Anjul Bhambhri

In the thought leadership series called 'Hadoop High 5', we ask leaders in big data and hadoop ecosystem on the vision and the future path. Continuing this series, we asked Anjul Bhambhri five questions. Anjul Bhambhri is the Vice President of Big Data at IBM. She was previously the Director of IBM Optim application and data life cycle management tools. She is a seasoned professional with over twenty-five years in the database industry. Over this time, Anjul has held various engineering and management positions at IBM, Informix and Sybase. Prior to her assignment in tools, Anjul spearheaded the development of XML capabilities in IBM's DB2 database server. She is a recipient of the YWCA of Silicon Valley's “Tribute to Women in Technology” award for 2009. Anjul holds a degree in Electrical Engineering. 
Click on the questions to read the response.

1. Tell us about your journey with big data and Hadoop so far.


IBM has invested heavily in Hadoop to-date, and we’ll continue to do so. We see it as a key technology that solves a variety of problems for our customers. InfoSphere BigInsights is IBM's distribution of Hadoop. It incorporates 100% standard open-source components, but we’ve included enterprise-grade features, which is very important to our client base. We’ve done this in a way that preserves the openness of the platform, but also provides significant advantages, helping customers deploy solutions faster and more cost-efficiently. Each one of our enterprise grade features is an opt-in for customers. They can choose to remain purely on the open source capabilities if they choose to do so, and we provide support per IBM standard support models. I would like to point out that this support model provides a longer longevity than available from other pure play open source vendors.  

We have also been active contributors in specific projects that are of value to our enterprise customer base like HBase, metadata management, Security, encryption, in addition to a number of bug fixes in various projects of the ecosystem. We have also brought the power of Hadoop to the Power and Z Mainframe platforms.

A good example of our commitment is Big SQL -- IBM’s ANSI compliant SQL on Hadoop. Big SQL leverages more than 30 years of experience in database engineering. Big SQL works natively on Hadoop data sources, and interoperates seamlessly with open source tools. This is a big win for customers since they can use their existing investments in SQL tools, including Cognos, SPSS and other third-party tools. They can gain seamless access to data, regardless of where it’s stored. They don't need to re-write their queries into someone's less-capable dialect of the SQL. This not only saves time and money -- it simplifies the environment and helps customers deploy solutions faster and more reliably.

IBM believes that Hadoop has managed to become the heterogenous compute platform that allows us to run a variety of applications, especially analytics. While a number of initial implementations focused on basic warehouse style processing, data lakes, and optimizations of transformation workloads, they are now graduating to higher level analytics using polystructured data. . New geospatial data is being brought in, for example, to analyze accident proclivity in specific zip codes based on road conditions. Traffic information is being integrated with vehicle wear and tear information to influence insurance rating policies. Such types of analytics require a large amount of both compute and storage and Hadoop has made it possible.

2. One of the common questions is that does IBM Watson complement or substitute Hadoop ecosystem?


IBM Watson specializes in the Cognitive compute, Q&A style analytics and solutions. Hadoop is an essential underpinning for such a system, especially in an enterprise context. Data in enterprises is imperfect, and needs a set of curation and aggregation steps to ensure that the Watson magic can be applied. Information extraction, Entity integration are 2 key elements that go into this curation process. IBM’s distribution of Hadoop, BigInsights, provides comprehensive Text and machine learning capabilities, that are used in this curaton process by Watson. If I'm writing an application to parse human and human call-center conversations, for example, or an application to process social media feeds or log files, I can build and deploy these applications much faster and get better results, because IBM has already done the heavy lifting in its text and machine learning engins.. This means customers can start solving real business problems faster, and glean insights more easily, from their unstructured data.
We see the actual capture and storage of data in Hadoop as the easy part of the problem. Anybody can do this. We're focused on the analytic tooling that can help customers get value out of the information they're capturing.

3. Among the various big data use cases that you have implemented with customers, which one really made you say ‘wow’?


Organizations today are overwhelmed with Big Data. In fact, the world is generating more than 2.5 billion gigabytes of data every day, and 80 percent of it is “unstructured”—everything from images, video and audio to social media and a blizzard of impulses from embedded sensors and distributed devices. Applying analytics to data can help businesses handle the onslaught of information to help make better business decisions, but most importantly, even saves lives.
UCLA, for instance, is relying on breakthrough big data technology to help patients with traumatic brain injuries. At UNC Healthcare, doctors are using a big data and Smarter Care solution to identify high-risk patients, understand in context what’s causing them to be hospitalized, and then take preventative action.
Scientists and engineers at IBM continue to push the boundaries of science and technology with the goal of creating systems that sense, learn, reason and interact with people in new ways. It’s aimed at making the relationship between human and computer more natural so that increasingly, computers will offer advice, rather than waiting for commands.

4. Looking over the horizon, where do you see Hadoop market heading for in next 3 years’ time frame?


The success of open source Hadoop means organizations today can rethink their approach to how they handle information. Rather than taking the time to decide what information to retain and what to discard, they can keep all the data in Hadoop in case it’s need, which is much more cost efficient. With access to more data, management can get a better understanding of customers, their needs and how they use products and services.
This is pretty impressive, considering few people even knew about Hadoop five years ago.

We believe that the focus for Hadoop will shift from data ingest and processing, and related plumbing to interactive information discovery, and collaborative analytics over the large amounts of data it encourages to be stored. We see Spark as a fundamental enabler of a new generation of analytics applications, especially because of its unified programming and application development model, as well as its unique approach to in-memory processing.
Hadoop will continue to be the handle the heavy batch reporting and analytics workloads as well as become the next generation warehousing platform. However, we believe that the combination of Spark and Hadoop executing on the same substrate of storage and compute, will solve the one fundamental problem of conventional warehousing – make it accessible, actionable, and useful for the business users by spawning an entirely new set of analytics applications.
For this to happen, there needs to be a strong tool chain, that enables every business analyst into a data scientist over time. It is also imperative that we have reference architectures and frameworks that allow for standardizations between applications, so that they can exchange and collaborate with one another.

5. Given some super powers for a day, what you would like to do with it?


I have always believed in using technology to bring fundamental change in people’s lives. Big data has the promise to do such changes.
As a woman and a mother, I hold dear to my heart, the education of children all the way through college. With my superpowers, I would create a set of applications and advanced analytics built on Hadoop and Spark that would help teachers to understand the student needs, The set of applications would also help teachers identify drop off indicators from universities, provide better counseling what courses and majors that fit the strength of each individual. The applications will use Watson to help students to attend the schools that best fit them, and the right level of financial assistance so that they can get into the workforce with little debt, if at all.
If I can get all of that done in a day, am sure the next day will have a new dawn!


Read more »