Temporal index sharding for space-time efficiency in archive search

  • Authors:
  • Avishek Anand;Srikanta Bedathur;Klaus Berberich;Ralf Schenkel

  • Affiliations:
  • Max-Planck Institute for Informatics, Saabruecken, Germany;IIIT, Delhi, India;Max-Planck Institute for Informatics, Saarbruecken, Germany;Saarland University , Saarbruecken, Germany

  • Venue:
  • Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Time-travel queries that couple temporal constraints with keyword queries are useful in searching large-scale archives of time-evolving content such as the web archives or wikis. Typical approaches for efficient evaluation of these queries involve slicing either the entire collection [20] or individual index lists [10] along the time-axis. Both these methods are not satisfactory since they sacrifice compactness of index for processing efficiency making them either too big or, otherwise, too slow. We present a novel index organization scheme that shards each index list with almost zero increase in index size but still minimizes the cost of reading index entries during query processing. Based on the optimal sharding thus btained, we develop a practically efficient sharding that takes into account the different costs of random and sequential accesses. Our algorithm merges shards from the optimal solution to allow for a few extra sequential accesses while gaining significantly by reducing the number of random accesses. We empirically establish the effectiveness of our sharding scheme with experiments over the revision history of the English Wikipedia between 2001-2005 (approx 700 GB) and an archive of U.K. governmental web sites (approx 400 GB). Our results demonstrate the feasibility of faster time-travel query processing with no space overhead.