A Method for Calculating Term Similarity on Large Document Collections

Authors:
Wolfgang W. Bein;Jeffrey S. Coombs;Kazem Taghva
Affiliations:
-;-;-
Venue:
ITCC '03 Proceedings of the International Conference on Information Technology: Computers and Communications
Year:
2003

Citing 0
Cited 1

Neurolinguistic approach to natural language processing with applications to medical text analysis

Neural Networks

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present an efficient algorithm called the QuadtreeHeuristic for identifying a list of similar terms for eachunique term in a large document collection. Term similarityis defined using the Expected Mutual Information Measure(EMIM). Since our aim for defining the similarity lists isto improve information retrieval (IR), we present the outcome of an experiment comparing the performance of anIR engine designed to use the similarity lists. Two methodswere used to generate similarity lists: a brute-force technique and the Quadtree Heuristic. The performance of thelist generated by the Quadtree Heuristic was commensuratewith the brute force list.