Efficient indexing of versioned document sequences

Authors:
Michael Herscovici;Ronny Lempel;Sivan Yogev
Affiliations:
Google Inc., Haifa, Israel;IBM Haifa Research Lab, Israel;IBM Haifa Research Lab, Israel
Venue:
ECIR'07 Proceedings of the 29th European conference on IR research
Year:
2007

Citing 12
Cited 14

Versioning a full-text information retrieval system

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Query evaluation: strategies and optimizations

Information Processing and Management: an International Journal
String editing and longest common subsequences

Handbook of formal languages, vol. 2
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Building a distributed full-text index for the Web

Proceedings of the 10th international conference on World Wide Web
Searching the Web

ACM Transactions on Internet Technology (TOIT)
Modern Information Retrieval

Modern Information Retrieval
Database System Implementation

Database System Implementation
Computers and Intractability: A Guide to the Theory of NP-Completeness

Computers and Intractability: A Guide to the Theory of NP-Completeness
Efficient single-pass index construction for text databases

Journal of the American Society for Information Science and Technology
Compressing and searching XML data via two zips

Proceedings of the 15th international conference on World Wide Web
Indexing shared content in information retrieval systems

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology

Efficient search in large textual collections with redundancy

Proceedings of the 16th international conference on World Wide Web
A time machine for text search

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
FluxCapacitor: efficient time-travel text search

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Inverted index compression and query processing with optimized document ordering

Proceedings of the 18th international conference on World wide web
Optimizing complex extraction programs over evolving text data

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Compact full-text indexing of versioned document collections

Proceedings of the 18th ACM conference on Information and knowledge management
Leveraging temporal dynamics of document content in relevance ranking

Proceedings of the third ACM international conference on Web search and data mining
Durable top-k search in document archives

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Improved index compression techniques for versioned document collections

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Hybrid index structures for temporal-textual web search

APWeb'11 Proceedings of the 13th Asia-Pacific web conference on Web technologies and applications
Temporal index sharding for space-time efficiency in archive search

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Faster temporal range queries over versioned text

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Index maintenance for time-travel text search

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Optimizing positional index structures for versioned document collections

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many information systems keep multiple versions of documents. Examples include content management systems, version control systems (e.g. ClearCase, CVS), Wikis, and backup and archiving solutions. Often, it is desired to enable free-text search over such repositories, i.e. to enable submitting queries that may match any version of any document. We propose an indexing method that takes advantage of the inherent redundancy present in versioned documents by solving a variant of the multiple sequence alignment problem. The scheme produces an index that is much more compact than a standard index that treats each version independently. In experiments over publicly available versioned data, our method achieved compaction ratios of 81% as compared with standard indexing, while supporting the same retrieval capabilities.