Using prosody for automatic sentence segmentation of multi-party meetings

Authors:
Jáchym Kolář;Elizabeth Shriberg;Yang Liu
Affiliations:
International Computer Science Institute, Berkeley, CA;International Computer Science Institute, Berkeley, CA;International Computer Science Institute, Berkeley, CA
Venue:
TSD'06 Proceedings of the 9th international conference on Text, Speech and Dialogue
Year:
2006

Citing 5
Cited 3

Bagging predictors

Machine Learning
BoosTexter: A Boosting-based Systemfor Text Categorization

Machine Learning - Special issue on information retrieval
Prosody-based automatic segmentation of speech into sentences and topics

Speech Communication - Special issue on accessing information in spoken audio
Fast and Robust Features for Prosodic Classification

TSD '99 Proceedings of the Second International Workshop on Text, Speech and Dialogue
Using conditional random fields for sentence boundary detection in speech

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics

A Comparison of Language Models for Dialog Act Segmentation of Meeting Transcripts

TSD '08 Proceedings of the 11th international conference on Text, Speech and Dialogue
Multi-view semi-supervised learning for dialog act segmentation of speech

IEEE Transactions on Audio, Speech, and Language Processing
The CALO meeting assistant system

IEEE Transactions on Audio, Speech, and Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We explore the use of prosodic features beyond pauses, including duration, pitch, and energy features, for automatic sentence segmentation of ICSI meeting data We examine two different approaches to boundary classification: score-level combination of independent language and prosodic models using HMMs, and feature-level combination of models using a boosting-based method (BoosTexter) We report classification results for reference word transcripts as well as for transcripts from a state-of-the-art automatic speech recognizer (ASR) We also compare results using the lexical model plus a pause-only prosody model, versus results using additional prosodic features Results show that (1) information from pauses is important, including pause duration both at the boundary and at the previous and following word boundaries; (2) adding duration, pitch, and energy features yields significant improvement over pause alone; (3) the integrated boosting-based model performs better than the HMM for ASR conditions; (4) training the boosting-based model on recognized words yields further improvement.