Optimizing RDF(S) queries on cloud platforms

Authors:
HyeongSik Kim;Padmashree Ravindra;Kemafor Anyanwu
Affiliations:
North Carolina State University, Raleigh, NC, USA;North Carolina State University, Raleigh, NC, USA;North Carolina State University, Raleigh, NC, USA
Venue:
Proceedings of the 22nd international conference on World Wide Web companion
Year:
2013

Citing 4
Cited 0

MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
An intermediate algebra for optimizing RDF graph pattern matching on MapReduce

ESWC'11 Proceedings of the 8th extended semantic web conference on The semanic web: research and applications - Volume Part II
Time – space trade-offs in scaling up RDF schema reasoning

WISE'05 Proceedings of the 2005 international conference on Web Information Systems Engineering
Scalable Multi-query Optimization for SPARQL

ICDE '12 Proceedings of the 2012 IEEE 28th International Conference on Data Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Scalable processing of Semantic Web queries has become a critical need given the rapid upward trend in availability of Semantic Web data. The MapReduce paradigm is emerging as a platform of choice for large scale data processing and analytics due to its ease of use, cost effectiveness, and potential for unlimited scaling. Processing queries on Semantic Web triple models is a challenge on the mainstream MapReduce platform called Apache Hadoop, and its extensions such as Pig and Hive. This is because such queries require numerous joins which leads to lengthy and expensive MapReduce workflows. Further, in this paradigm, cloud resources are acquired on demand and the traditional join optimization machinery such as statistics and indexes are often absent or not easily supported. In this demonstration, we will present RAPID+, an extended Apache Pig system that uses an algebraic approach for optimizing queries on RDF data models including queries involving inferencing. The basic idea is that by using logical and physical operators that are more natural to MapReduce processing, we can reinterpret such queries in a way that leads to more concise execution workflows and small intermediate data footprints that minimize disk I/Os and network transfer overhead. RAPID+ evaluates queries using the Nested TripleGroup Data Model and Algebra(NTGA). The demo will show comparative performance of NTGA query plans vs. relational algebra-like query plans used by Apache Pig and Hive.