Utterance segmentation using combined approach based on Bi-directional N-gram and maximum entropy

Authors:
Ding Liu;Chengqing Zong
Affiliations:
Chinese Academy of Sciences, Beijing, China;Chinese Academy of Sciences, Beijing, China
Venue:
SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
Year:
2003

Citing 8
Cited 1

A maximum entropy approach to natural language processing

Computational Linguistics
Statistical Models for Text Segmentation

Machine Learning - Special issue on natural language learning
Nine Issues in Speech Translation

Machine Translation
Experiments on sentence boundary detection

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
A maximum entropy approach to identifying sentence boundaries

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Splitting long or ill-formed input for robust spoken-language translation

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Understanding unsegmented user utterances in real-time spoken dialogue systems

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Chinese utterance segmentation in spoken language translation

CICLing'03 Proceedings of the 4th international conference on Computational linguistics and intelligent text processing

Detecting sentence boundaries in japanese speech transcriptions using a morphological analyzer

IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes a new approach to segmentation of utterances into sentences using a new linguistic model based upon Maximum-entropy-weighted Bi-directional N-grams. The usual N-gram algorithm searches for sentence boundaries in a text from left to right only. Thus a candidate sentence boundary in the text is evaluated mainly with respect to its left context, without fully considering its right context. Using this approach, utterances are often divided into incomplete sentences or fragments. In order to make use of both the right and left contexts of candidate sentence boundaries, we propose a new linguistic modeling approach based on Maximum-entropy-weighted Bi-directional N-grams. Experimental results indicate that the new approach significantly outperforms the usual N-gram algorithm for segmenting both Chinese and English utterances.