And,
you thought your Big Data cluster was faster than your next door competitor.
Oh, well… it’s time to prove it, contribute to an initiative and of course, get
the honors. A new benchmarking initiative called BigData Top100 was announced at the
O’Reilly Strata Conference 2013 in a joint presentation by Chaitan Baru (SDSC) and Milind
Bhandarkar (Greenplum). Other members of the BigData Top100 List steering
group, include Dhruba Borthakur (Facebook), Eyal Gutkind (Mellanox), Jian Li
(IBM), Raghunath Nambiar (Cisco), Ken Osterberg (Seagate), Scott Pearson
(Brocade), Meikel Poess (Oracle), Tilmann Rabl (University of Toronto), Richard
Treadway (NetApp), and Jerry Zhao (Google).
In
the proposed benchmark, the group aims to arrive at a list of systems which can
process the representative big data workload on a dataset of fixed size in the
least amount of total time procured on a fixed budget, as specified by the
benchmark.
The
seeds of this initiative go back to late 2011 when the Center for Large-scale
Data Systems Research (CLDS) at the San Diego
Supercomputer Center, University
of California San Diego initiated
this activity. As part of this activity, 2 key workshops have been organized in
May 2012 at San Jose (USA) and in December 2012 at Pune (India ). The
third workshop is planned in July 2013 at Xi’an
(China ).
“These
meetings substantiated the initial ideas for a big data benchmark, which
would include definitions of the data along with a data-generation procedure;
a workload representing common big data applications;
and a set of metrics, run rules,
and full-disclosure reports for fair comparisons of technologies and platforms. These results would then be presented in the form of the BigData Top100 List, released on a regular basis at a predefined venue such as at the Strata Conferences.”
a workload representing common big data applications;
and a set of metrics, run rules,
and full-disclosure reports for fair comparisons of technologies and platforms. These results would then be presented in the form of the BigData Top100 List, released on a regular basis at a predefined venue such as at the Strata Conferences.”
In a
paper published in Big Data Journal, the group has established a workload specification to be used in
first version of the benchmark. The various steps of proposed end-to-end
entity-modeling pipeline include:
1- Collect “user” interactions data and ingest them into the big data
platform(s)
2- Reorder the events according to the entity of interest, with
secondary ordering according to timestamps
3- Join the “fact tables” with various other “dimension tables.”
4- Identify events of interest that one plans to correlate with other
events in the same session for each entity
5- Build a model for favorable/unfavorable target events based on the
past session information
6- Score the models built in the previous step with the hold-out data
7- Apply the models to the initial entities, which did not result in the
target event
8- Publish and apply the model
|
From
a critique perspective, the first evident pointer leads to similar initiatives
in high performance computing. Queries have also been raised on the
methodology to arrive at a fair comparison. Also, the steering group
is loaded with corporate organizations. But, with this critique itself lie
probably the strengths of this pioneering initiative. Firstly, if a robust list
of Top 100 and in future Top 500 Big Data systems can come out, then not only
it would stimulate competition, it would also ensure a standard to live by in
non enterprise and enterprise clusters. Similarly, the steering group has made
it clear that it would be a “concurrent benchmarking model, where one
version of the benchmark is implemented while the next revision is concurrently
being developed, incorporating more features and feedback from the first round
of benchmarking”. Further, the presence of heavyweight corporate
representation in the steering group as well as workshops indicates the
interest is high enough and the competition also.
With
regards to roll out, watch out for 3 Kaggle contests coming up this year for data
collection, reference implementations and proposals respectively. Wishing the
organizers and contestants a good one and hope to see the benchmark aligning further as a self organizing initiative.
------------------------------------------------------
image courtesy: freedigitalphotos.net
Comments
Post a Comment