Document identifier reassignment through dimensionality reduction

  • Authors:
  • Roi Blanco;Álvaro Barreiro

  • Affiliations:
  • AILab. Computer Science Department, University of Corunna, Spain;AILab. Computer Science Department, University of Corunna, Spain

  • Venue:
  • ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Most modern retrieval systems use compressed Inverted Files (IF) for indexing. Recent works demonstrated that it is possible to reduce IF sizes by reassigning the document identifiers of the original collection, as it lowers the average distance between documents related to a single term. Variable-bit encoding schemes can exploit the average gap reduction and decrease the total amount of bits per document pointer. However, approximations developed so far requires great amounts of time or use an uncontrolled memory size. This paper presents an efficient solution to the reassignment problem consisting in reducing the input data dimensionality using a SVD transformation. We tested this approximation with the Greedy-NN TSP algorithm and one more efficient variant based on dividing the original problem in sub-problems. We present experimental tests and performance results in two TREC collections, obtaining good compression ratios with low running times. We also show experimental results about the tradeoff between dimensionality reduction and compression, and time performance.