Choosing a distance metric for automatic word categorization

Authors:
Emin Erkan Korkmaz;Göktürk Üçoluk
Affiliations:
Middle East Technical University, Ankara-Turkey;Middle East Technical University, Ankara-Turkey
Venue:
NeMLaP3/CoNLL '98 Proceedings of the Joint Conferences on New Methods in Language Processing and Computational Natural Language Learning
Year:
1998

Citing 5
Cited 1

Unsupervised Optimal Fuzzy Clustering

IEEE Transactions on Pattern Analysis and Machine Intelligence
Class-based n-gram models of natural language

Computational Linguistics
Improving statistical language model performance with automatically generated word hierarchies

Computational Linguistics
Automated induction of a lexical sublanguage grammar using a hybrid system of corpus- and knowledge-based techniques

Automated induction of a lexical sublanguage grammar using a hybrid system of corpus- and knowledge-based techniques
Unsupervised language acquisition

Unsupervised language acquisition

Learning a Mahalanobis distance metric for data clustering and classification

Pattern Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper analyzes the functionality of different distance metrics that can be used in a bottom-up unsupervised algorithm for automatic word categorization. The proposed method uses a modified greedy-type algorithm. The formulations of fuzzy theory are also used to calculate the degree of membership for the elements in the linguistic clusters formed. The unigram and the bigram statistics of a corpus of about two million words are used. Empirical comparisons are made in order to support the discussions proposed for the type of distance metric that would be most suitable for measuring the similarity between linguistic elements.