Index maintenance for time-travel text search

Authors:
Avishek Anand;Srikanta Bedathur;Klaus Berberich;Ralf Schenkel
Affiliations:
Max-Planck Institute for Informatics, Saarbruecken, Germany;IIIT-Delhi, New Delhi, India;Max-Planck Institute for Informatics, Saarbruecken, Germany;Saarland University, Saarbruecken, Germany
Venue:
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Year:
2012

Citing 26
Cited 1

Versioning a full-text information retrieval system

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
The log-structured merge-tree (LSM-tree)

Acta Informatica
Self-indexing inverted files for fast text retrieval

ACM Transactions on Information Systems (TOIS)
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Comparison of access methods for time-evolving data

ACM Computing Surveys (CSUR)
The LHAM log-structured history data access method

The VLDB Journal — The International Journal on Very Large Data Bases
An asymptotically optimal multiversion B-tree

The VLDB Journal — The International Journal on Very Large Data Bases
Efficient query evaluation using a two-level retrieval process

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Inverted files for text search engines

ACM Computing Surveys (CSUR)
Efficient online index maintenance for contiguous inverted lists

Information Processing and Management: an International Journal
Finding near-duplicate web pages: a large-scale evaluation of algorithms

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Efficient search in large textual collections with redundancy

Proceedings of the 16th international conference on World Wide Web
Efficient document retrieval in main memory

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
A time machine for text search

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Hybrid index maintenance for contiguous inverted lists

Information Retrieval
Efficient online index construction for text databases

ACM Transactions on Database Systems (TODS)
Introduction to Information Retrieval

Introduction to Information Retrieval
Compact full-text indexing of versioned document collections

Proceedings of the 18th ACM conference on Information and knowledge management
On-line index maintenance using horizontal partitioning

Proceedings of the 18th ACM conference on Information and knowledge management
Efficient indexing of versioned document sequences

ECIR'07 Proceedings of the 29th European conference on IR research
Information Retrieval: Implementing and Evaluating Search Engines

Information Retrieval: Implementing and Evaluating Search Engines
Efficient temporal keyword search over versioned text

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Improved index compression techniques for versioned document collections

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Large-scale incremental processing using distributed transactions and notifications

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Temporal index sharding for space-time efficiency in archive search

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Faster top-k document retrieval using block-max indexes

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval

A survey of web archive search architectures

Proceedings of the 22nd international conference on World Wide Web companion

Quantified Score

Hi-index	0.00

Visualization

Abstract

Time-travel text search enriches standard text search by temporal predicates, so that users of web archives can easily retrieve document versions that are considered relevant to a given keyword query and existed during a given time interval. Different index structures have been proposed to efficiently support time-travel text search. None of them, however, can easily be updated as the Web evolves and new document versions are added to the web archive. In this work, we describe a novel index structure that efficiently supports time-travel text search and can be maintained incrementally as new document versions are added to the web archive. Our solution uses a sharded index organization, bounds the number of spuriously read index entries per shard, and can be maintained using small in-memory buffers and append-only operations. We present experiments on two large-scale real-world datasets demonstrating that maintaining our novel index structure is an order of magnitude more efficient than periodically rebuilding one of the existing index structures, while query-processing performance is not adversely affected.