Detecting sentence boundaries in japanese speech transcriptions using a morphological analyzer

Authors:
Sachie Tajima;Hidetsugu Nanba;Manabu Okumura
Affiliations:
Interdisciplinary Graduate School of Science and Engineering, Tokyo Institute of Technology, Yokohama, Japan;Graduate School of Information Sciences Hiroshima City University, Hiroshima, Japan;Precision and Intelligence Laboratory, Tokyo Institute of Technology, Yokohama, Japan
Venue:
IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
Year:
2004

Citing 6
Cited 0

``Pause Units'' and Analysis of Spontaneous Japanese Dialogues: Preliminary Studies

ECAI '96 Workshop on Dialogue Processing in Spoken Language Systems
Experiments on sentence boundary detection

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
Adaptive sentence boundary disambiguation

ANLC '94 Proceedings of the fourth conference on Applied natural language processing
A maximum entropy approach to identifying sentence boundaries

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Comma restoration using constituency information

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Utterance segmentation using combined approach based on Bi-directional N-gram and maximum entropy

SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a method to automatically detect sentenceboundaries(SBs) in Japanese speech transcriptions. Our method uses a Japanese morphological analyzer that is based on a cost calculation and selects as the best result the one with the minimum cost. The idea behind using a morphological analyzer to identify candidates for SBs is that the analyzer outputs lower costs for better sequences of morphemes. After the candidate SBs have been identified, the unsuitable candidates are deleted by using lexical information acquired from the training corpus. Our method had a 77.24% precision, 88.00% recall, and 0.8277 F-Measure, for a corpus consisting of lecture speech transcriptions in which the SBs are not given.