A Method for Calculating Term Similarity on Large Document Collections

  • Authors:
  • Wolfgang W. Bein;Jeffrey S. Coombs;Kazem Taghva

  • Affiliations:
  • -;-;-

  • Venue:
  • ITCC '03 Proceedings of the International Conference on Information Technology: Computers and Communications
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present an efficient algorithm called the QuadtreeHeuristic for identifying a list of similar terms for eachunique term in a large document collection. Term similarityis defined using the Expected Mutual Information Measure(EMIM). Since our aim for defining the similarity lists isto improve information retrieval (IR), we present the outcome of an experiment comparing the performance of anIR engine designed to use the similarity lists. Two methodswere used to generate similarity lists: a brute-force technique and the Quadtree Heuristic. The performance of thelist generated by the Quadtree Heuristic was commensuratewith the brute force list.