Japanese word segmentation by hidden Markov model

Authors:
Constantine P. Papageorgiou
Affiliations:
BBN Systems and Technologies, Cambridge, MA
Venue:
HLT '94 Proceedings of the workshop on Human Language Technology
Year:
1994

Citing 7
Cited 3

Character code for Japanese text processing

Journal of Information Processing
Japanese word processing

IEEE Spectrum
Studies in part of speech labelling

HLT '91 Proceedings of the workshop on Speech and Natural Language
A probabilistic algorithm for segmenting non-Kanji Japanese strings

AAAI '94 Proceedings of the twelfth national conference on Artificial intelligence (vol. 1)
A stochastic parts program and noun phrase parser for unrestricted text

ANLC '88 Proceedings of the second conference on Applied natural language processing
LINGSTAT: an interactive, machine-aided translation system

HLT '93 Proceedings of the workshop on Human Language Technology
Example-based correction of word segmentation and part of speech labelling

HLT '93 Proceedings of the workshop on Human Language Technology

Mostly-unsupervised statistical segmentation of Japanese Kanji sequences

Natural Language Engineering
Mostly-unsupervised statistical segmentation of Japanese: applications to kanji

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
An Information-Extraction System for Urdu---A Resource-Poor Language

ACM Transactions on Asian Language Information Processing (TALIP)

Quantified Score

Hi-index	0.00

Visualization

Abstract

The processing of Japanese text is complicated by the fact that there are no word delimiters. To segment Japanese text, systems typically use knowledge-based methods and large lexicons. This paper presents a novel approach to Japanese word segmentation which avoids the need for Japanese word lexicons and explicit rule bases. The algorithm utilizes a hidden Markov model, a stochastic process, to determine word boundaries. This method has achieved 91% accuracy in segmenting words in a test corpus.