Use of mutual information based character clusters in dictionary-less morphological analysis of Japanese

  • Authors:
  • Hideki Kashioka;Yasuhiro Kawata;Yumiko Kinjo;Andrew Finch;Ezra W. Black

  • Affiliations:
  • ATR Interpreting Telecommunications Research Laboratories;ATR Interpreting Telecommunications Research Laboratories;ATR Interpreting Telecommunications Research Laboratories;ATR Interpreting Telecommunications Research Laboratories;ATR Interpreting Telecommunications Research Laboratories

  • Venue:
  • COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
  • Year:
  • 1998

Quantified Score

Hi-index 0.00

Visualization

Abstract

For languages whose character set is very large and whose orthography does not require spacing between words, such as Japanese, tokenizing and part-of-speech tagging are often the difficult parts of any morphological analysis. For practical systems to tackle this problem, uncontrolled heuristics are primarily used. The use of information on character sorts, however, mitigates this difficulty. This paper presents our method of incorporating character clustering based on mutual information into Decision-Tree Dictionary-less morphological analysis. By using natural classes, we have confirmed that our morphological analyzer has been significantly improved in both tokenizing and tagging Japanese text.