Word clustering and disambiguation based on co-occurrence data

  • Authors:
  • Hang Li

  • Affiliations:
  • Theory NEC Laboratory, Real World Computing Partnership, c//o Internet Systems Research Laboratories, NEC Corporation, 4-1-1 Miyazaki, Miyamae-ku, Kawasaki 216-8555, Japan/ e-mail: hangli@mi ...

  • Venue:
  • Natural Language Engineering
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

We address the problem of clustering words (or constructing a thesaurus) based on co-occurrence data, and conducting syntactic disambiguation by using the acquired word classes. We view the clustering problem as that of estimating a class-based probability distribution specifying the joint probabilities of word pairs. We propose an efficient algorithm based on the Minimum Description Length (MDL) principle for estimating such a probability model. Our clustering method is a natural extension of that proposed in Brown, Della Pietra, deSouza, Lai and Mercer (1992). We next propose a syntactic disambiguation method which combines the use of automatically constructed word classes and that of a hand-made thesaurus. The overall disambiguation accuracy achieved by our method is 88.2%, which compares favorably against the accuracies obtained by the state-of-the-art disambiguation methods.