Towards scalable RDF graph analytics on MapReduce

  • Authors:
  • Padmashree Ravindra;Vikas V. Deshpande;Kemafor Anyanwu

  • Affiliations:
  • North Carolina State University, Raleigh, NC;North Carolina State University, Raleigh, NC;North Carolina State University, Raleigh, NC

  • Venue:
  • Proceedings of the 2010 Workshop on Massive Data Analytics on the Cloud
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

In order to exploit the growing amount of RDF data in decision-making, there is an increasing demand for analytics-style processing of such data. RDF data is modeled as a labeled graph that represents a collection of binary relations (triples). In this context, analytical queries can be interpreted as consisting of three main constructs namely pattern matching, grouping and aggregation, and require several join operations to reassemble them into n-ary relations relevant to the given query, unlike traditional OLAP systems where data is suitably organized. MapReduce-based parallel processing systems like Pig have gained success in processing scalable analytical workloads. However, these systems offer only relational algebra style operators which would require an iterative n-tuple reassembly process in which intermediate results need to be materialized. This leads to high I/O costs that negatively impacts performance. In this paper, we propose UDFs that (i) re-factor analytical processing on RDF graphs in a way that enables more parallelized processing (ii) perform a look-ahead processing to reduce the cost of subsequent operators in the query execution plan. These functions have been integrated into the Pig Latin function library and the experimental results show up to 50% improvement in execution times for certain classes of queries. An important impact of this work is that it could serve as the foundation for additional physical operators in systems such as Pig for more efficient graph processing.