Graph-based word clustering using a web search engine

  • Authors:
  • Yutaka Matsuo;Takeshi Sakaki;Kôki Uchiyama;Mitsuru Ishizuka

  • Affiliations:
  • National Institute of Advanced, Industrial Science and Technology, Sotokanda, Tokyo;University of Tokyo, Hongo, Tokyo;Hottolink Inc., Nishi-gotanda, Tokyo;University of Tokyo, Hongo, Tokyo

  • Venue:
  • EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Word clustering is important for automatic thesaurus construction, text classification, and word sense disambiguation. Recently, several studies have reported using the web as a corpus. This paper proposes an unsupervised algorithm for word clustering based on a word similarity measure by web counts. Each pair of words is queried to a search engine, which produces a co-occurrence matrix. By calculating the similarity of words, a word co-occurrence graph is obtained. A new kind of graph clustering algorithm called Newman clustering is applied for efficiently identifying word clusters. Evaluations are made on two sets of word groups derived from a web directory and WordNet.