An intermediate algebra for optimizing RDF graph pattern matching on MapReduce

Authors:
Padmashree Ravindra;HyeongSik Kim;Kemafor Anyanwu
Affiliations:
Department of Computer Science, North Carolina State University, Raleigh, NC;Department of Computer Science, North Carolina State University, Raleigh, NC;Department of Computer Science, North Carolina State University, Raleigh, NC
Venue:
ESWC'11 Proceedings of the 8th extended semantic web conference on The semanic web: research and applications - Volume Part II
Year:
2011

Citing 21
Cited 5

Interpreting the data: Parallel analysis with Sawzall

Scientific Programming - Dynamic Grids and Worldwide Computing
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Using slice join for efficient evaluation of multi-way joins

Data & Knowledge Engineering
Scalable Semantics - The Silver Lining of Cloud Computing

ESCIENCE '08 Proceedings of the 2008 Fourth IEEE International Conference on eScience
Hive: a warehousing solution over a map-reduce framework

Proceedings of the VLDB Endowment
HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads

Proceedings of the VLDB Endowment
Scalable Distributed Reasoning Using MapReduce

ISWC '09 Proceedings of the 8th International Semantic Web Conference
RAPID: Enabling Scalable Ad-Hoc Analytics on the Semantic Web

ISWC '09 Proceedings of the 8th International Semantic Web Conference
The RDF-3X engine for scalable management of RDF data

The VLDB Journal — The International Journal on Very Large Data Bases
Optimizing joins in a map-reduce environment

Proceedings of the 13th International Conference on Extending Database Technology
Towards scalable RDF graph analytics on MapReduce

Proceedings of the 2010 Workshop on Massive Data Analytics on the Cloud
Pregel: a system for large-scale graph processing

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
A comparison of join algorithms for log processing in MaPreduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
HadoopDB in action: building real world applications

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Data Intensive Query Processing for Large RDF Graphs Using Cloud Computing Tools

CLOUD '10 Proceedings of the 2010 IEEE 3rd International Conference on Cloud Computing
DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Signal/collect: graph algorithms for the (semantic) web

ISWC'10 Proceedings of the 9th international semantic web conference on The semantic web - Volume Part I
RDFBroker: a signature-based high-performance RDF store

ESWC'06 Proceedings of the 3rd European conference on The Semantic Web: research and applications
OWL reasoning with WebPIE: calculating the closure of 100 billion triples

ESWC'10 Proceedings of the 7th international conference on The Semantic Web: research and Applications - Volume Part I
Efficiently joining group patterns in SPARQL queries

ESWC'10 Proceedings of the 7th international conference on The Semantic Web: research and Applications - Volume Part I

Efficient processing of RDF graph pattern matching on MapReduce platforms

Proceedings of the second international workshop on Data intensive computing in the clouds
To nest or not to nest, when and how much: representing intermediate results of graph pattern queries in MapReduce based processing

SWIM '12 Proceedings of the 4th International Workshop on Semantic Web Information Management
Scalable processing of flexible graph pattern queries on the cloud

Proceedings of the 22nd international conference on World Wide Web companion
Optimizing RDF(S) queries on cloud platforms

Proceedings of the 22nd international conference on World Wide Web companion
The family of mapreduce and large-scale data processing systems

ACM Computing Surveys (CSUR)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Existing MapReduce systems support relational style join operators which translate multi-join query plans into severalMap-Reduce cycles. This leads to high I/O and communication costs due to the multiple data transfer steps between map and reduce phases. SPARQL graph pattern matching is dominated by join operations, and is unlikely to be efficiently processed using existing techniques. This cost is prohibitive for RDF graph pattern matching queries which typically involve several join operations. In this paper, we propose an approach for optimizing graph pattern matching by reinterpreting certain join tree structures as grouping operations. This enables a greater degree of parallelism in join processing resulting in more "bushy" like query execution plans with fewer Map-Reduce cycles. This approach requires that the intermediate results are managed as sets of groups of triples or TripleGroups. We therefore propose a data model and algebra - Nested TripleGroup Algebra for capturing and manipulating TripleGroups. The relationship with the traditional relational style algebra used in Apache Pig is discussed. A comparative performance evaluation of the traditional Pig approach and RAPID+ (Pig extended with NTGA) for graph pattern matching queries on the BSBM benchmark dataset is presented. Results show up to 60% performance improvement of our approach over traditional Pig for some tasks.