CORDS: automatic discovery of correlations and soft functional dependencies
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
ISOMER: Consistent Histogram Construction Using Query Feedback
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
GORDIAN: efficient and scalable discovery of composite keys
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
BHUNT: automatic discovery of Fuzzy algebraic constraints in relational data
VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Pig latin: a not-so-foreign language for data processing
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Statistical Analysis and Data Mining
A comparison of approaches to large-scale data analysis
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations
ICDM '09 Proceedings of the 2009 Ninth IEEE International Conference on Data Mining
Hive: a warehousing solution over a map-reduce framework
Proceedings of the VLDB Endowment
The Art of Building a Good Benchmark
Performance Evaluation and Benchmarking
The performance of MapReduce: an in-depth study
Proceedings of the VLDB Endowment
Hadoop++: making a yellow elephant run like a cheetah (without it even noticing)
Proceedings of the VLDB Endowment
Massively parallel data analysis with PACTs on Nephele
Proceedings of the VLDB Endowment
HADI: Mining Radii of Large Graphs
ACM Transactions on Knowledge Discovery from Data (TKDD)
A data generator for cloud-scale benchmarking
TPCTC'10 Proceedings of the Second TPC technology conference on Performance evaluation, measurement and characterization of complex systems
Data generation using declarative constraints
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Realistic, mathematically tractable graph generation and evolution, using kronecker multiplication
PKDD'05 Proceedings of the 9th European conference on Principles and Practice of Knowledge Discovery in Databases
Efficient update data generation for DBMS benchmarks
ICPE '12 Proceedings of the 3rd ACM/SPEC International Conference on Performance Engineering
Distributed GraphLab: a framework for machine learning and data mining in the cloud
Proceedings of the VLDB Endowment
ASTERIX: scalable warehouse-style web data integration
Proceedings of the Ninth International Workshop on Information Integration on the Web
Myriad: parallel data generation on shared-nothing architectures
Proceedings of the 1st Workshop on Architectures and Systems for Big Data
Scalable test data generation from multidimensional models
Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering
Hi-index | 0.00 |
The academic community and industry are currently researching and building next generation data management systems. These systems are designed to analyze data sets of high volume with high data ingest rates and short response times executing complex data analysis algorithms on data that does not adhere to relational data models. As these big data systems differ from standard relational database systems with respect to data and workloads, the traditional benchmarks used by the database community are insufficient. In this paper, we describe initial solutions and challenges with respect to big data generation, methods for creating realistic, privacy-aware, and arbitrarily scalable data sets, workloads, and benchmarks from real world data. We will in particular discuss why we feel that workloads currently discussed in the testing and benchmarking community do not capture the real complexity of big data and highlight several research challenges with respect to massively-parallel data generation and data characterization.