Efficient indexing of versioned document sequences

  • Authors:
  • Michael Herscovici;Ronny Lempel;Sivan Yogev

  • Affiliations:
  • Google Inc., Haifa, Israel;IBM Haifa Research Lab, Israel;IBM Haifa Research Lab, Israel

  • Venue:
  • ECIR'07 Proceedings of the 29th European conference on IR research
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Many information systems keep multiple versions of documents. Examples include content management systems, version control systems (e.g. ClearCase, CVS), Wikis, and backup and archiving solutions. Often, it is desired to enable free-text search over such repositories, i.e. to enable submitting queries that may match any version of any document. We propose an indexing method that takes advantage of the inherent redundancy present in versioned documents by solving a variant of the multiple sequence alignment problem. The scheme produces an index that is much more compact than a standard index that treats each version independently. In experiments over publicly available versioned data, our method achieved compaction ratios of 81% as compared with standard indexing, while supporting the same retrieval capabilities.