Compact full-text indexing of versioned document collections

Authors:
Jinru He;Hao Yan;Torsten Suel
Affiliations:
Polytechnic Institute of NYU, Brooklyn, NY, USA;Polytechnic Institute of NYU, Brooklyn, NY, USA;Polytechnic Institute of NYU, Brooklyn, NY, USA
Venue:
Proceedings of the 18th ACM conference on Information and knowledge management
Year:
2009

Citing 27
Cited 9

Versioning a full-text information retrieval system

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Interactive communication of balanced distributions and of correlated files

SIAM Journal on Discrete Mathematics
Improved hierarchical bit-vector compression in document retrieval systems

Proceedings of the 9th annual international ACM SIGIR conference on Research and development in information retrieval
Finding replicated Web collections

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
A low-bandwidth network file system

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Binary Interpolative Coding for Effective Index Compression

Information Retrieval
Venti: A New Approach to Archival Storage

FAST '02 Proceedings of the Conference on File and Storage Technologies
Inverted file compression through document identifier reassignment

Information Processing and Management: an International Journal
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Winnowing: local algorithms for document fingerprinting

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Index Compression through Document Reordering

DCC '02 Proceedings of the Data Compression Conference
Assigning identifiers to documents to enhance the clustering property of fulltext indexes

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Efficient randomized pattern-matching algorithms

IBM Journal of Research and Development - Mathematics and computing
Pastiche: making backup cheap and easy

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Super-Scalar RAM-CPU Cache Compression

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Inverted files for text search engines

ACM Computing Surveys (CSUR)
TSP and cluster-based solutions to the reassignment of document identifiers

Information Retrieval
Efficient search in large textual collections with redundancy

Proceedings of the 16th international conference on World Wide Web
Redundancy elimination within large collections of files

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
A time machine for text search

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Incremental cluster-based retrieval using compressed cluster-skipping inverted files

ACM Transactions on Information Systems (TOIS)
Performance of compressed inverted list caching in search engines

Proceedings of the 17th international conference on World Wide Web
Inverted index compression and query processing with optimized document ordering

Proceedings of the 18th international conference on World wide web
Efficient indexing of versioned document sequences

ECIR'07 Proceedings of the 29th European conference on IR research
Sorting out the document identifier assignment problem

ECIR'07 Proceedings of the 29th European conference on IR research
Indexing shared content in information retrieval systems

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
Document identifier reassignment through dimensionality reduction

ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research

Durable top-k search in document archives

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Using the past to score the present: extending term weighting models through revision history analysis

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Improved index compression techniques for versioned document collections

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
A word at a time: computing word relatedness using temporal semantic analysis

Proceedings of the 20th international conference on World wide web
Temporal index sharding for space-time efficiency in archive search

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Faster temporal range queries over versioned text

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Indexes for highly repetitive document collections

Proceedings of the 20th ACM international conference on Information and knowledge management
Index maintenance for time-travel text search

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Optimizing positional index structures for versioned document collections

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

We study the problem of creating highly compressed full-text index structures for versioned document collections, that is, collections that contain multiple versions of each document. Important examples of such collections are Wikipedia or the web page archive maintained by the Internet Archive. A straightforward indexing approach would simply treat each document version as a separate document, such that index size scales linearly with the number of versions. However, several authors have recently studied approaches that exploit the significant similarities between different versions of the same document to obtain much smaller index sizes. In this paper, we propose new techniques for organizing and compressing inverted index structures for such collections. We also perform a detailed experimental comparison of new techniques and the existing techniques in the literature. Our results on an archive of the English version of Wikipedia, and on a subset of the Internet Archive collection, show significant benefits over previous approaches.