Scalable distributed indexing and query processing over Linked Data

  • Authors:
  • Marcel Karnstedt;Kai-Uwe Sattler;Manfred Hauswirth

  • Affiliations:
  • Digital Enterprise Research Institute (DERI), National University of Ireland, Galway, Ireland;Faculty of Computer Science and Automation, Ilmenau University of Technology, Germany;Digital Enterprise Research Institute (DERI), National University of Ireland, Galway, Ireland

  • Venue:
  • Web Semantics: Science, Services and Agents on the World Wide Web
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Linked Data is becoming the core part of modern Web applications and thus efficient access to structured information expressed in RDF gains paramount importance. A number of efficient local RDF stores exist already, while distributed indexing and distributed query processing over Linked Data with similar efficiency and data management features as known from traditional database and data integration systems are only starting to develop. Distributed approaches will necessarily co-exist with centralized schemes, as data will be owned by different stakeholders who may not want to provide their complete data sets to a central place. Additionally, central/integrated storage may be prohibited for organizational or legal reasons in certain areas. To support decentralized schemes, only a few attempts in this direction exist so far, but they are limited in terms of capabilities and the degree of distribution vs. efficiency, query expressivity, and scalability. To remedy this situation, the approach and proof-of-concept prototype presented in this paper provides a solution for these open challenges. As we argue for widely distributed systems as a possible answer to scalability issues, we first identify and discuss the main challenges and based on this analysis, we propose an approach for efficient and scalable query processing over distributed Linked Data sources, taking into account the latest advances in database technology. Our system is based on a layered architecture that makes use of the advantages of decentralized indexing and query processing approaches, which have been researched and matured over the last decade. Our approach is based on a logical algebra for queries over RDF data and a related physical query algebra to enable optimization, both on the logical and physical layers in query processing. The introduced operators and strategies for processing complex query plans make excessive use of parallelism and other optimization paradigms of distributed query processing. Our query processing framework includes a sophisticated cost model to enable cost-efficient query planning and query execution. We extensively evaluate our approach through an experimental evaluation of a real proof-of-concept deployment, which demonstrates the efficiency, applicability, and correctness of the proposed concepts.