Use of mutual information based character clusters in dictionary-less morphological analysis of Japanese

Authors:
Hideki Kashioka;Yasuhiro Kawata;Yumiko Kinjo;Andrew Finch;Ezra W. Black
Affiliations:
ATR Interpreting Telecommunications Research Laboratories;ATR Interpreting Telecommunications Research Laboratories;ATR Interpreting Telecommunications Research Laboratories;ATR Interpreting Telecommunications Research Laboratories;ATR Interpreting Telecommunications Research Laboratories
Venue:
COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Year:
1998

Citing 5
Cited 4

Class-based n-gram models of natural language

Computational Linguistics
Some advances in transformation-based part of speech tagging

AAAI '94 Proceedings of the twelfth national conference on Artificial intelligence (vol. 1)
A stochastic parts program and noun phrase parser for unrestricted text

ANLC '88 Proceedings of the second conference on Applied natural language processing
A practical part-of-speech tagger

ANLC '92 Proceedings of the third conference on Applied natural language processing
A stochastic Japanese morphological analyzer using a forward-DP backward-A* N-best search algorithm

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1

Mostly-unsupervised statistical segmentation of Japanese Kanji sequences

Natural Language Engineering
Mostly-unsupervised statistical segmentation of Japanese: applications to kanji

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Latent morpho-semantic analysis: multilingual information retrieval with character n-grams and mutual information

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
An information-theoretic, vector-space-model approach to cross-language information retrieval*

Natural Language Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

For languages whose character set is very large and whose orthography does not require spacing between words, such as Japanese, tokenizing and part-of-speech tagging are often the difficult parts of any morphological analysis. For practical systems to tackle this problem, uncontrolled heuristics are primarily used. The use of information on character sorts, however, mitigates this difficulty. This paper presents our method of incorporating character clustering based on mutual information into Decision-Tree Dictionary-less morphological analysis. By using natural classes, we have confirmed that our morphological analyzer has been significantly improved in both tokenizing and tagging Japanese text.