To nest or not to nest, when and how much: representing intermediate results of graph pattern queries in MapReduce based processing

Authors:
Padmashree Ravindra;HyeongSik Kim;Kemafor Anyanwu
Affiliations:
North Carolina State University;North Carolina State University;North Carolina State University
Venue:
SWIM '12 Proceedings of the 4th International Workshop on Semantic Web Information Management
Year:
2012

Citing 20
Cited 1

Nested relations and complex objects in databases

Nested relations and complex objects in databases
Jena: A Semantic Web Toolkit

IEEE Internet Computing
Evaluation of Main Memory Join Algorithms for Joins with Set Comparison Join Predicates

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Efficient processing of joins on set-valued attributes

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Scalable semantic web data management using vertical partitioning

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
RDF Data-Centric Storage

ICWS '09 Proceedings of the 2009 IEEE International Conference on Web Services
NCSU's Virtual Computing Lab: A Cloud Computing Solution

Computer
Hive: a warehousing solution over a map-reduce framework

Proceedings of the VLDB Endowment
HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads

Proceedings of the VLDB Endowment
The RDF-3X engine for scalable management of RDF data

The VLDB Journal — The International Journal on Very Large Data Bases
Optimizing joins in a map-reduce environment

Proceedings of the 13th International Conference on Extending Database Technology
DBpedia: a nucleus for a web of open data

ISWC'07/ASWC'07 Proceedings of the 6th international The semantic web and 2nd Asian conference on Asian semantic web conference
Efficient parallel set-similarity joins using MapReduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
MRShare: sharing across multiple queries in MapReduce

Proceedings of the VLDB Endowment
Distributed cube materialization on holistic measures

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
Heuristics-Based Query Processing for Large RDF Graphs Using Cloud Computing

IEEE Transactions on Knowledge and Data Engineering
An intermediate algebra for optimizing RDF graph pattern matching on MapReduce

ESWC'11 Proceedings of the 8th extended semantic web conference on The semanic web: research and applications - Volume Part II
Query optimization for massively parallel data processing

Proceedings of the 2nd ACM Symposium on Cloud Computing

Scalable processing of flexible graph pattern queries on the cloud

Proceedings of the 22nd international conference on World Wide Web companion

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many queries on RDF datasets involve triple patterns whose properties are multi-valued. When processing such queries using flat data models and their associated algebras, intermediate results could contain a lot of redundancy. In the context of processing using MapReduce based platforms such as Hadoop, such redundancy could account for a non-trivial proportion of overall disk I/O, sorting and network data transfer costs. Further, when MapReduce workflows consist of multiple cycles as is typical when processing RDF graph pattern queries, these costs could compound over multiple cycles. However, it may be possible to avoid such overhead if nested data models and algebras are used. In this short paper, we present some on-going research into the use of a nested TripleGroup data model and Algebra (NTGA) for MapReduce based RDF graph processing. The NTGA operators fully subscribe to the NTG data model. This is in contrast to systems such as Pig where the data model supports some nesting but the algebra is primarily tuple based (requiring the flattening of nested objects before other operators can be applied). This full subscription to the nested data model by NTGA also enables support for different unnesting strategies including delayed and partial unnesting. We present a preliminary evaluation of these strategies for efficient management of multi-valued properties while processing graph pattern queries in Apache Pig.