Robust runtime optimization and skew-resistant execution of analytical SPARQL queries on pig

Authors:
Spyros Kotoulas;Jacopo Urbani;Peter Boncz;Peter Mika
Affiliations:
IBM Research, Ireland,Vrije Universiteit Amsterdam, The Netherlands;IBM Research, Ireland;CWI Amsterdam, The Netherlands,Vrije Universiteit Amsterdam, The Netherlands;Yahoo! Research Barcelona, Spain
Venue:
ISWC'12 Proceedings of the 11th international conference on The Semantic Web - Volume Part I
Year:
2012

Citing 16
Cited 2

Bifocal sampling for skew-resistant join size estimation

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Hexastore: sextuple indexing for semantic web data management

Proceedings of the VLDB Endowment
ROX: run-time optimization of XQueries

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Scalable join processing on very large RDF graphs

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Media Meets Semantic Web --- How the BBC Uses DBpedia and Linked Data to Make Connections

ESWC 2009 Heraklion Proceedings of the 6th European Semantic Web Conference on The Semantic Web: Research and Applications
Building a high-level dataflow system on top of Map-Reduce: the Pig experience

Proceedings of the VLDB Endowment
HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads

Proceedings of the VLDB Endowment
The RDF-3X engine for scalable management of RDF data

The VLDB Journal — The International Journal on Very Large Data Bases
Mind the data skew: distributed inferencing by speeddating in elastic regions

Proceedings of the 19th international conference on World wide web
Towards scalable RDF graph analytics on MapReduce

Proceedings of the 2010 Workshop on Massive Data Analytics on the Cloud
An architecture for recycling intermediates in a column-store

ACM Transactions on Database Systems (TODS)
Automatic optimization for MapReduce programs

Proceedings of the VLDB Endowment
PigSPARQL: mapping SPARQL to Pig Latin

Proceedings of the International Workshop on Semantic Web Information Management
Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
The design and implementation of minimal RDFS backward reasoning in 4store

ESWC'11 Proceedings of the 8th extended semantic web conference on The semanic web: research and applications - Volume Part II

Toward a data scalable solution for facilitating discovery of scientific data resources

DISCS-2013 Proceedings of the 2013 International Workshop on Data-Intensive Scalable Computing Systems
WOO: a scalable and multi-tenant platform for continuous knowledge base synthesis

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

We describe a system that incrementally translates SPARQL queries to Pig Latin and executes them on a Hadoop cluster. This system is designed to work efficiently on complex queries with many self-joins over huge datasets, avoiding job failures even in the case of joins with unexpected high-value skew. To be robust against cost estimation errors, our system interleaves query optimization with query execution, determining the next steps to take based on data samples and statistics gathered during the previous step. Furthermore, we have developed a novel skew-resistant join algorithm that replicates tuples corresponding to popular keys. We evaluate the effectiveness of our approach both on a synthetic benchmark known to generate complex queries (BSBM-BI) as well as on a Yahoo! case of data analysis using RDF data crawled from the web. Our results indicate that our system is indeed capable of processing huge datasets without pre-computed statistics while exhibiting good load-balancing properties.