Towards scalable RDF graph analytics on MapReduce

Authors:
Padmashree Ravindra;Vikas V. Deshpande;Kemafor Anyanwu
Affiliations:
North Carolina State University, Raleigh, NC;North Carolina State University, Raleigh, NC;North Carolina State University, Raleigh, NC
Venue:
Proceedings of the 2010 Workshop on Massive Data Analytics on the Cloud
Year:
2010

Citing 13
Cited 5

Interpreting the data: Parallel analysis with Sawzall

Scientific Programming - Dynamic Grids and Worldwide Computing
Map-reduce-merge: simplified relational data processing on large clusters

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Scalable semantic web data management using vertical partitioning

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Hexastore: sextuple indexing for semantic web data management

Proceedings of the VLDB Endowment
SCOPE: easy and efficient parallel processing of massive data sets

Proceedings of the VLDB Endowment
Scalable Semantics - The Silver Lining of Cloud Computing

ESCIENCE '08 Proceedings of the 2008 Fourth IEEE International Conference on eScience
A comparison of approaches to large-scale data analysis

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Scalable join processing on very large RDF graphs

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads

Proceedings of the VLDB Endowment
Optimizing joins in a map-reduce environment

Proceedings of the 13th International Conference on Extending Database Technology
DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation

PigSPARQL: mapping SPARQL to Pig Latin

Proceedings of the International Workshop on Semantic Web Information Management
An intermediate algebra for optimizing RDF graph pattern matching on MapReduce

ESWC'11 Proceedings of the 8th extended semantic web conference on The semanic web: research and applications - Volume Part II
TripleCloud: An Infrastructure for Exploratory Querying over Web-Scale RDF Data

WI-IAT '11 Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 03
HadoopRDF: a scalable semantic data analytical engine

ICIC'12 Proceedings of the 8th international conference on Intelligent Computing Theories and Applications
Robust runtime optimization and skew-resistant execution of analytical SPARQL queries on pig

ISWC'12 Proceedings of the 11th international conference on The Semantic Web - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

In order to exploit the growing amount of RDF data in decision-making, there is an increasing demand for analytics-style processing of such data. RDF data is modeled as a labeled graph that represents a collection of binary relations (triples). In this context, analytical queries can be interpreted as consisting of three main constructs namely pattern matching, grouping and aggregation, and require several join operations to reassemble them into n-ary relations relevant to the given query, unlike traditional OLAP systems where data is suitably organized. MapReduce-based parallel processing systems like Pig have gained success in processing scalable analytical workloads. However, these systems offer only relational algebra style operators which would require an iterative n-tuple reassembly process in which intermediate results need to be materialized. This leads to high I/O costs that negatively impacts performance. In this paper, we propose UDFs that (i) re-factor analytical processing on RDF graphs in a way that enables more parallelized processing (ii) perform a look-ahead processing to reduce the cost of subsequent operators in the query execution plan. These functions have been integrated into the Pig Latin function library and the experimental results show up to 50% improvement in execution times for certain classes of queries. An important impact of this work is that it could serve as the foundation for additional physical operators in systems such as Pig for more efficient graph processing.