BigBench: towards an industry standard benchmark for big data analytics

Authors:
Ahmad Ghazal;Tilmann Rabl;Minqing Hu;Francois Raab;Meikel Poess;Alain Crolotte;Hans-Arno Jacobsen
Affiliations:
Teradata Corp., El Segundo, CA, USA;University of Toronto, Toronto, ON, Canada;Teradata Corp., El Segundo, CA, USA;InfoSizing, Inc., Manitou Springs, CO, USA;Oracle Corp., Redwood Shores, CA, USA;Teradata Corp., El Segundo, CA, USA;University of Toronto, Toronto, ON, Canada
Venue:
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Year:
2013

Citing 19
Cited 3

The 007 Benchmark

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
The BUCKY object-relational benchmark

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Programming pearls (2nd ed.)

Programming pearls (2nd ed.)
The making of TPC-DS

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Interpreting the data: Parallel analysis with Sawzall

Scientific Programming - Dynamic Grids and Worldwide Computing
XMark: a benchmark for XML data management

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Why you should run TPC-DS: a workload analysis

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
A comparison of approaches to large-scale data analysis

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
SQL/MapReduce: a practical approach to self-describing, polymorphic, and parallelizable user-defined functions

Proceedings of the VLDB Endowment
Principles for an ETL Benchmark

Performance Evaluation and Benchmarking
FlumeJava: easy, efficient data-parallel pipelines

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Benchmarking cloud serving systems with YCSB

Proceedings of the 1st ACM symposium on Cloud computing
A data generator for cloud-scale benchmarking

TPCTC'10 Proceedings of the Second TPC technology conference on Performance evaluation, measurement and characterization of complex systems
EXRT: towards a simple benchmark for XML readiness testing

TPCTC'10 Proceedings of the Second TPC technology conference on Performance evaluation, measurement and characterization of complex systems
YCSB++: benchmarking and performance debugging advanced features in scalable table stores

Proceedings of the 2nd ACM Symposium on Cloud Computing
Efficient update data generation for DBMS benchmarks

ICPE '12 Proceedings of the 3rd ACM/SPEC International Conference on Performance Engineering
Solving big data challenges for enterprise application performance management

Proceedings of the VLDB Endowment

DeepSea: self-adaptive data partitioning and replication in scalable distributed data systems

Proceedings of the 2013 Sigmod/PODS Ph.D. symposium on PhD symposium
Can we analyze big data inside a DBMS?

Proceedings of the sixteenth international workshop on Data warehousing and OLAP
A multidimensional data model for TPC-DS benchmarking

Proceedings of the 5th Asia-Pacific Symposium on Internetware

Quantified Score

Hi-index	0.00

Visualization

Abstract

There is a tremendous interest in big data by academia, industry and a large user base. Several commercial and open source providers unleashed a variety of products to support big data storage and processing. As these products mature, there is a need to evaluate and compare the performance of these systems. In this paper, we present BigBench, an end-to-end big data benchmark proposal. The underlying business model of BigBench is a product retailer. The proposal covers a data model and synthetic data generator that addresses the variety, velocity and volume aspects of big data systems containing structured, semi-structured and unstructured data. The structured part of the BigBench data model is adopted from the TPC-DS benchmark, which is enriched with semi-structured and unstructured data components. The semi-structured part captures registered and guest user clicks on the retailer's website. The unstructured data captures product reviews submitted online. The data generator designed for BigBench provides scalable volumes of raw data based on a scale factor. The BigBench workload is designed around a set of queries against the data model. From a business prospective, the queries cover the different categories of big data analytics proposed by McKinsey. From a technical prospective, the queries are designed to span three different dimensions based on data sources, query processing types and analytic techniques. We illustrate the feasibility of BigBench by implementing it on the Teradata Aster Database. The test includes generating and loading a 200 Gigabyte BigBench data set and testing the workload by executing the BigBench queries (written using Teradata Aster SQL-MR) and reporting their response times.