Empirical Formula for Testing Word Similarity and Its Application for Constructing a Word Frequency List

Authors:
Pavel Makagonov;Mikhail Alexandrov
Affiliations:
-;-
Venue:
CICLing '02 Proceedings of the Third International Conference on Computational Linguistics and Intelligent Text Processing
Year:
2002

Citing 1
Cited 1

Foundations of statistical natural language processing

Foundations of statistical natural language processing

Constructing Empirical Formulas for Testing Word Similarity by the Inductive Method of Model Self-Organization

PorTAL '02 Proceedings of the Third International Conference on Advances in Natural Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In many tasks of document categorization and clustering it is necessary to automatically learn a word frequency list from a corpus. However, morphological variations of words disturb the statistics when the program considers the words as mere letter strings. Thus it is important to identify the strings resulting from morphological variation of the same base meaning. Since using large morphological dictionaries has its well-known technical disadvantages, we propose a heuristic approximate method for such identification based on an empirical formula for testing the similarity of two words. We give a simple method for the determination of the formula parameters. The formula is based on the number of the coincident letters in the initial parts of the two words and the number of non-coincident letters in the final parts of these two words. An iterative algorithm constructs the word frequency list using common parts of all similar words. We give English and Spanish examples. The described technology is implemented in our system Dictionary Designer.