Morphological analysis of a large spontaneous speech corpus in Japanese

Authors:
Kiyotaka Uchimoto;Chikashi Nobata;Atsushi Yamada;Satoshi Sekine;Hitoshi Isahara
Affiliations:
Communications Research Laboratory, Kyoto, Japan;Communications Research Laboratory, Kyoto, Japan;Communications Research Laboratory, Kyoto, Japan;Communications Research Laboratory, Kyoto, Japan;New York University, New York, NY
Venue:
ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Year:
2003

Citing 4
Cited 2

A maximum entropy approach to natural language processing

Computational Linguistics
Word extraction from corpora and its part-of-speech estimation using distributional analysis

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2
A part of speech estimation method for Japanese unknown words using a statistical model of morphology and context

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Morphological analysis of the spontaneous speech corpus

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 2

Japanese unknown word identification by character-based chunking

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Morphological annotation of a large spontaneous speech corpus in Japanese

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes two methods for detecting word segments and their morphological information in a Japanese spontaneous speech corpus, and describes how to tag a large spontaneous speech corpus accurately by using the two methods. The first method is used to detect any type of word segments. The second method is used when there are several definitions for word segments and their POS categories, and when one type of word segments includes another type of word segments. In this paper, we show that by using semi-automatic analysis we achieve a precision of better than 99% for detecting and tagging short words and 97% for long words; the two types of words that comprise the corpus. We also show that better accuracy is achieved by using both methods than by using only the first.