Hierarchical clustering of words

  • Authors:
  • Akira Ushioda

  • Affiliations:
  • ATR Interpreting Telecommunications Research Laboratories, Kyoto, Japan

  • Venue:
  • COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2
  • Year:
  • 1996

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper describes a data-driven method for hierarchical clustering of words in which a large vocabulary of English words is clustered bottom-up, with respect to corpora ranging in size from 5 to 50 million words, using a greedy algorithm that tries to minimize average loss of mutual information of adjacent classes. The resulting hierarchical clusters of words are then naturally transformed to a bit-string representation of (i.e. word bilts for) all the words in the vocabulary. Introducing word bits into the ATR Decision-Tree POS Tagger is shown to significantly reduce the tagging error rate. Portability of word bits from one domain to another is also disscussed.