Apples and oranges: a comparison of RDF benchmarks and real RDF datasets

Authors:
Songyun Duan;Anastasios Kementsietsidis;Kavitha Srinivas;Octavian Udrea
Affiliations:
IBM Research - Thomas J. Watson Research Ctr, Hawthorne, NY, USA;IBM Research - Thomas J. Watson Research Ctr, Hawthorne, NY, USA;IBM Research - Thomas J. Watson Research Ctr, Hawthorne, NY, USA;IBM Research - Thomas J. Watson Research Ctr, Hawthorne, NY, USA
Venue:
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Year:
2011

Citing 11
Cited 13

A normal form for XML documents

Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Foundations of Databases: The Logical Level

Foundations of Databases: The Logical Level
Sesame: A Generic Architecture for Storing and Querying RDF and RDF Schema

ISWC '02 Proceedings of the First International Semantic Web Conference on The Semantic Web
Scalable semantic web data management using vertical partitioning

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
YAGO: A Large Ontology from Wikipedia and WordNet

Web Semantics: Science, Services and Agents on the World Wide Web
RDF-3X: a RISC-style engine for RDF

Proceedings of the VLDB Endowment
Prefix based numbering schemes for XML: techniques, applications and performances

Proceedings of the VLDB Endowment
SP^2Bench: A SPARQL Performance Benchmark

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
DBpedia - A crystallization point for the Web of Data

Web Semantics: Science, Services and Agents on the World Wide Web
The RDF-3X engine for scalable management of RDF data

The VLDB Journal — The International Journal on Very Large Data Bases
LUBM: A benchmark for OWL knowledge base systems

Web Semantics: Science, Services and Agents on the World Wide Web

DBpedia SPARQL benchmark: performance assessment with real queries on real data

ISWC'11 Proceedings of the 10th international conference on The semantic web - Volume Part I
FedBench: a benchmark suite for federated semantic data query processing

ISWC'11 Proceedings of the 10th international conference on The semantic web - Volume Part I
Practical RDF schema reasoning with annotated semantic web data

ISWC'11 Proceedings of the 10th international conference on The semantic web - Volume Part I
Heuristics-based query optimisation for SPARQL

Proceedings of the 15th International Conference on Extending Database Technology
Sharing statistics for SPARQL federation optimization, with emphasis on benchmark quality

ESWC'12 Proceedings of the 9th international conference on The Semantic Web: research and applications
SPAM: a SPARQL analysis and manipulation tool

Proceedings of the VLDB Endowment
Exploring dictionary-based semantic relatedness in labeled tree data

Information Sciences: an International Journal
SPLODGE: systematic generation of SPARQL benchmark queries for linked open data

ISWC'12 Proceedings of the 11th international conference on The Semantic Web - Volume Part I
SRBench: a streaming RDF/SPARQL benchmark

ISWC'12 Proceedings of the 11th international conference on The Semantic Web - Volume Part I
Linked stream data processing engines: facts and figures

ISWC'12 Proceedings of the 11th international conference on The Semantic Web - Volume Part II
Tridex: A lightweight triple index for relational database-based Semantic Web data management

Expert Systems with Applications: An International Journal
Building an efficient RDF store over a relational database

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Large-scale bisimulation of RDF graphs

Proceedings of the Fifth Workshop on Semantic Web Information Management

Quantified Score

Hi-index	0.00

Visualization

Abstract

The widespread adoption of the Resource Description Framework (RDF) for the representation of both open web and enterprise data is the driving force behind the increasing research interest in RDF data management. As RDF data management systems proliferate, so are benchmarks to test the scalability and performance of these systems under data and workloads with various characteristics. In this paper, we compare data generated with existing RDF benchmarks and data found in widely used real RDF datasets. The results of our comparison illustrate that existing benchmark data have little in common with real data. Therefore any conclusions drawn from existing benchmark tests might not actually translate to expected behaviours in real settings. In terms of the comparison itself, we show that simple primitive data metrics are inadequate to flesh out the fundamental differences between real and benchmark data. We make two contributions in this paper: (1) To address the limitations of the primitive metrics, we introduce intuitive and novel metrics that can indeed highlight the key differences between distinct datasets; (2) To address the limitations of existing benchmarks, we introduce a new benchmark generator with the following novel characteristics: (a) the generator can use any (real or synthetic) dataset and convert it into a benchmark dataset; (b) the generator can generate data that mimic the characteristics of real datasets with user-specified data properties. On the technical side, we formulate the benchmark generation problem as an integer programming problem whose solution provides us with the desired benchmark datasets. To our knowledge, this is the first methodological study of RDF benchmarks, as well as the first attempt on generating RDF benchmarks in a principled way.