Clustering words with the MDL principle

  • Authors:
  • Hang Li;Naoki Abe

  • Affiliations:
  • Theory NEC Laboratory, RWCP, NEC, Kawasaki, Japan;Theory NEC Laboratory, RWCP, NEC, Kawasaki, Japan

  • Venue:
  • COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1
  • Year:
  • 1996

Quantified Score

Hi-index 0.00

Visualization

Abstract

We address the problem of automatically constructing a thesaurus by clustering words based on corpus data. We view this problem as that of estimating a joint distribution over the Cartesian product of a partition of a set of nouns and a partition of a set of verbs, and propose a learning algorithm based on the Minimum Description Length (MDL) Principle for such estimation. We empirically compared the performance of our method based on the MDL Principle against the Maximum Likelihood Estimator in word clustering, and found that the former outperforms the latter. We also evaluated the method by conducting pp-attachment disambiguation experiments using an automatically constructed thesaurus. Our experimental results indicate that such a thesaurus can be used to improve accuracy in disambiguation.