To nest or not to nest, when and how much: representing intermediate results of graph pattern queries in MapReduce based processing

  • Authors:
  • Padmashree Ravindra;HyeongSik Kim;Kemafor Anyanwu

  • Affiliations:
  • North Carolina State University;North Carolina State University;North Carolina State University

  • Venue:
  • SWIM '12 Proceedings of the 4th International Workshop on Semantic Web Information Management
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Many queries on RDF datasets involve triple patterns whose properties are multi-valued. When processing such queries using flat data models and their associated algebras, intermediate results could contain a lot of redundancy. In the context of processing using MapReduce based platforms such as Hadoop, such redundancy could account for a non-trivial proportion of overall disk I/O, sorting and network data transfer costs. Further, when MapReduce workflows consist of multiple cycles as is typical when processing RDF graph pattern queries, these costs could compound over multiple cycles. However, it may be possible to avoid such overhead if nested data models and algebras are used. In this short paper, we present some on-going research into the use of a nested TripleGroup data model and Algebra (NTGA) for MapReduce based RDF graph processing. The NTGA operators fully subscribe to the NTG data model. This is in contrast to systems such as Pig where the data model supports some nesting but the algebra is primarily tuple based (requiring the flattening of nested objects before other operators can be applied). This full subscription to the nested data model by NTGA also enables support for different unnesting strategies including delayed and partial unnesting. We present a preliminary evaluation of these strategies for efficient management of multi-valued properties while processing graph pattern queries in Apache Pig.