Temporal index sharding for space-time efficiency in archive search

Authors:
Avishek Anand;Srikanta Bedathur;Klaus Berberich;Ralf Schenkel
Affiliations:
Max-Planck Institute for Informatics, Saabruecken, Germany;IIIT, Delhi, India;Max-Planck Institute for Informatics, Saarbruecken, Germany;Saarland University , Saarbruecken, Germany
Venue:
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Year:
2011

Citing 17
Cited 1

Versioning a full-text information retrieval system

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Wave-indices: indexing evolving databases

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
The LHAM log-structured history data access method

The VLDB Journal — The International Journal on Very Large Data Bases
An asymptotically optimal multiversion B-tree

The VLDB Journal — The International Journal on Very Large Data Bases
A large-scale study of the evolution of web pages

WWW '03 Proceedings of the 12th international conference on World Wide Web
What's new on the web?: the evolution of the web from a search engine perspective

Proceedings of the 13th international conference on World Wide Web
Inverted files for text search engines

ACM Computing Surveys (CSUR)
Rate of change and other metrics: a live study of the world wide web

USITS'97 Proceedings of the USENIX Symposium on Internet Technologies and Systems on USENIX Symposium on Internet Technologies and Systems
A time machine for text search

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
On the value of temporal information in information retrieval

ACM SIGIR Forum
Introduction to Information Retrieval

Introduction to Information Retrieval
The web changes everything: understanding the dynamics of web content

Proceedings of the Second ACM International Conference on Web Search and Data Mining
Compact full-text indexing of versioned document collections

Proceedings of the 18th ACM conference on Information and knowledge management
Efficient indexing of versioned document sequences

ECIR'07 Proceedings of the 29th European conference on IR research
Information Retrieval: Implementing and Evaluating Search Engines

Information Retrieval: Implementing and Evaluating Search Engines
Efficient temporal keyword search over versioned text

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Improved index compression techniques for versioned document collections

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management

Index maintenance for time-travel text search

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Time-travel queries that couple temporal constraints with keyword queries are useful in searching large-scale archives of time-evolving content such as the web archives or wikis. Typical approaches for efficient evaluation of these queries involve slicing either the entire collection [20] or individual index lists [10] along the time-axis. Both these methods are not satisfactory since they sacrifice compactness of index for processing efficiency making them either too big or, otherwise, too slow. We present a novel index organization scheme that shards each index list with almost zero increase in index size but still minimizes the cost of reading index entries during query processing. Based on the optimal sharding thus btained, we develop a practically efficient sharding that takes into account the different costs of random and sequential accesses. Our algorithm merges shards from the optimal solution to allow for a few extra sequential accesses while gaining significantly by reducing the number of random accesses. We empirically establish the effectiveness of our sharding scheme with experiments over the revision history of the English Wikipedia between 2001-2005 (approx 700 GB) and an archive of U.K. governmental web sites (approx 400 GB). Our results demonstrate the feasibility of faster time-travel query processing with no space overhead.