Mostly-unsupervised statistical segmentation of Japanese Kanji sequences

Authors:
Rie Kubota Ando;Lillian Lee
Affiliations:
IBM Thomas J. Watson Research Center, P.O. Box 704, Yorktown Heights, NY 10598, USA e-mail: rie1@us.ibm.com;Department of Computer Science, Cornell University, Ithaca, NY 14853-7501 USA e-mail: llee@cs.cornell.edu
Venue:
Natural Language Engineering
Year:
2003

Citing 24
Cited 6

Word association norms, mutual information, and lexicography

Computational Linguistics
Elements of information theory

Elements of information theory
Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
Chinese text segmentation for text retrieval: achievements and problems

Journal of the American Society for Information Science
A probabilistic algorithm for segmenting non-Kanji Japanese strings

AAAI '94 Proceedings of the twelfth national conference on Artificial intelligence (vol. 1)
A stochastic finite-state word-segmentation algorithm for Chinese

Computational Linguistics
A new statistical formula for Chinese text segmentation incorporating contextual information

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Discovering Chinese words from unsegmented text (poster abstract)

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Self-Supervised Chinese Word Segmentation

IDA '01 Proceedings of the 4th International Conference on Advances in Intelligent Data Analysis
A compression-based algorithm for Chinese word segmentation

Computational Linguistics
Unsupervised learning of the morphology of a natural language

Computational Linguistics
MARSYAS: a framework for audio analysis

Organised Sound
Evaluating parsing strategies using standardized parse files

ANLC '92 Proceedings of the third conference on Applied natural language processing
A trainable rule-based algorithm for word segmentation

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Chinese word segmentation without using lexicon and hand-crafted training data

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Japanese morphological analyzer using word co-occurrence: JTAG

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Use of mutual information based character clusters in dictionary-less morphological analysis of Japanese

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
A stochastic Japanese morphological analyzer using a forward-DP backward-A* N-best search algorithm

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1
A new method of N-gram statistics for large number of n and automatic extraction of words and phrases from large text data of Japanese

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1
Context-based spelling correction for Japanese OCR

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2
Chinese text segmentation with MBDP-1: making the most of training corpora

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
LINGSTAT: an interactive, machine-aided translation system

HLT '93 Proceedings of the workshop on Human Language Technology
Example-based correction of word segmentation and part of speech labelling

HLT '93 Proceedings of the workshop on Human Language Technology
Japanese word segmentation by hidden Markov model

HLT '94 Proceedings of the workshop on Human Language Technology

A maximum entropy Chinese character-based parser

EMNLP '03 Proceedings of the 2003 conference on Empirical methods in natural language processing
Unsupervised query segmentation using generative language models and wikipedia

Proceedings of the 17th international conference on World Wide Web
Bayesian unsupervised word segmentation with nested Pitman-Yor language modeling

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
A new unsupervised approach to word segmentation

Computational Linguistics
Splitting noun compounds via monolingual and bilingual paraphrasing: a study on Japanese katakana words

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Unknown Chinese word extraction based on variety of overlapping strings

Information Processing and Management: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Given the lack of word delimiters in written Japanese, word segmentation is generally considered a crucial first step in processing Japanese texts. Typical Japanese segmentation algorithms rely either on a lexicon and syntactic analysis or on pre-segmented data; but these are labor-intensive, and the lexico-syntactic techniques are vulnerable to the unknown word problem. In contrast, we introduce a novel, more robust statistical method utilizing unsegmented training data. Despite its simplicity, the algorithm yields performance on long kanji sequences comparable to and sometimes surpassing that of state-of-the-art morphological analyzers over a variety of error metrics. The algorithm also outperforms another mostly-unsupervised statistical algorithm previously proposed for Chinese. Additionally, we present a two-level annotation scheme for Japanese to incorporate multiple segmentation granularities, and introduce two novel evaluation metrics, both based on the notion of a compatible bracket, that can account for multiple granularities simultaneously.