Optimizing positional index structures for versioned document collections

Authors:
JInru He;Torsten Suel
Affiliations:
Polytechnic Institute of NYU, Brooklyn, NY, USA;Polytechnic Institute of NYU, Brooklyn, NY, USA
Venue:
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Year:
2012

Citing 30
Cited 0

Versioning a full-text information retrieval system

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Delta algorithms: an empirical analysis

ACM Transactions on Software Engineering and Methodology (TOSEM)
A protocol-independent technique for eliminating redundant network traffic

Proceedings of the conference on Applications, Technologies, Architectures, and Protocols for Computer Communication
A low-bandwidth network file system

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Binary Interpolative Coding for Effective Index Compression

Information Retrieval
Dynamic maintenance of web indexes using landmarks

WWW '03 Proceedings of the 12th international conference on World Wide Web
Winnowing: local algorithms for document fingerprinting

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
The webgraph framework I: compression techniques

Proceedings of the 13th international conference on World Wide Web
Efficient randomized pattern-matching algorithms

IBM Journal of Research and Development - Mathematics and computing
Hierarchical substring caching for efficient content distribution to low-bandwidth clients

WWW '05 Proceedings of the 14th international conference on World Wide Web
A Markov random field model for term dependencies

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Super-Scalar RAM-CPU Cache Compression

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Inverted files for text search engines

ACM Computing Surveys (CSUR)
Efficient search in large textual collections with redundancy

Proceedings of the 16th international conference on World Wide Web
Redundancy elimination within large collections of files

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
A time machine for text search

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Incremental cluster-based retrieval using compressed cluster-skipping inverted files

ACM Transactions on Information Systems (TOIS)
The Minimum Substring Cover problem

Information and Computation
Challenges in building large-scale information retrieval systems: invited talk

Proceedings of the Second ACM International Conference on Web Search and Data Mining
Inverted index compression and query processing with optimized document ordering

Proceedings of the 18th international conference on World wide web
EverLast: a distributed architecture for preserving the web

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Compact full-text indexing of versioned document collections

Proceedings of the 18th ACM conference on Information and knowledge management
Term proximity scoring for keyword-based retrieval systems

ECIR'03 Proceedings of the 25th European conference on IR research
Efficient indexing of versioned document sequences

ECIR'07 Proceedings of the 29th European conference on IR research
Durable top-k search in document archives

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Efficient temporal keyword search over versioned text

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Improved index compression techniques for versioned document collections

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Faster temporal range queries over versioned text

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Indexes for highly repetitive document collections

Proceedings of the 20th ACM international conference on Information and knowledge management
Indexing shared content in information retrieval systems

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

Versioned document collections are collections that contain multiple versions of each document. Important examples are Web archives, Wikipedia and other wikis, or source code and documents maintained in revision control systems. Versioned document collections can become very large, due to the need to retain past versions, but there is also a lot of redundancy between versions that can be exploited. Thus, versioned document collections are usually stored using special differential (delta) compression techniques, and a number of researchers have recently studied how to exploit this redundancy to obtain more succinct full-text index structures. In this paper, we study index organization and compression techniques for such versioned full-text index structures. In particular, we focus on the case of positional index structures, while most previous work has focused on the non-positional case. Building on earlier work in [zs:redun], we propose a framework for indexing and querying in versioned document collections that integrates non-positional and positional indexes to enable fast top-k query processing. Within this framework, we define and study the problem of minimizing positional index size through optimal substring partitioning. Experiments on Wikipedia and web archive data show that our techniques achieve significant reductions in index size over previous work while supporting very fast query processing.