Faster temporal range queries over versioned text

Authors:
Jinru He;Torsten Suel
Affiliations:
Polytechnic Institute of New York University, Brooklyn, NY, USA;Polytechnic Institute of New York University, Brooklyn, NY, USA
Venue:
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Year:
2011

Citing 23
Cited 1

Versioning a full-text information retrieval system

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Binary Interpolative Coding for Effective Index Compression

Information Retrieval
Inverted file compression through document identifier reassignment

Information Processing and Management: an International Journal
Index Compression through Document Reordering

DCC '02 Proceedings of the Data Compression Conference
Assigning identifiers to documents to enhance the clustering property of fulltext indexes

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Hybrid index structures for location-based web search

Proceedings of the 14th ACM international conference on Information and knowledge management
Super-Scalar RAM-CPU Cache Compression

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Inverted files for text search engines

ACM Computing Surveys (CSUR)
Efficient query processing in geographic web search engines

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Efficient search in large textual collections with redundancy

Proceedings of the 16th international conference on World Wide Web
A time machine for text search

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Incremental cluster-based retrieval using compressed cluster-skipping inverted files

ACM Transactions on Information Systems (TOIS)
Performance of compressed inverted list caching in search engines

Proceedings of the 17th international conference on World Wide Web
Inverted index compression and query processing with optimized document ordering

Proceedings of the 18th international conference on World wide web
Compact full-text indexing of versioned document collections

Proceedings of the 18th ACM conference on Information and knowledge management
Efficient retrieval of the top-k most relevant spatial web objects

Proceedings of the VLDB Endowment
Efficient indexing of versioned document sequences

ECIR'07 Proceedings of the 29th European conference on IR research
Sorting out the document identifier assignment problem

ECIR'07 Proceedings of the 29th European conference on IR research
Durable top-k search in document archives

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Efficient temporal keyword search over versioned text

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Improved index compression techniques for versioned document collections

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Indexing shared content in information retrieval systems

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
Document identifier reassignment through dimensionality reduction

ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research

Optimizing positional index structures for versioned document collections

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Versioned textual collections are collections that retain multiple versions of a document as it evolves over time. Important large-scale examples are Wikipedia and the web collection of the Internet Archive. Search queries over such collections often use keywords as well as temporal constraints, most commonly a time range of interest. In this paper, we study how to support such temporal range queries over versioned text. Our goal is to process these queries faster than the corresponding keyword-only queries, by exploiting the additional constraint. A simple approach might partition the index into different time ranges, and then access only the relevant parts. However, specialized inverted index compression techniques are crucial for large versioned collections, and a naive partitioning can negatively affect index size and query throughput. We show how to achieve high query throughput by using smart index partitioning techniques that take index compression into account. Experiments on over 85 million versions of Wikipedia articles show that queries can be executed in a few milliseconds on memory-based index structures, and only slightly more time on disk-based structures. We also show how to efficiently support the recently proposed stable top-k search primitive on top of our schemes.