Structural event detection for rich transcription of speech

Authors:
Yang Liu;Mary P. Harper
Affiliations:
Purdue University;Purdue University
Venue:
Structural event detection for rich transcription of speech
Year:
2004

Citing 0
Cited 15

Using maximum entropy (ME) model to incorporate gesture cues for SU detection

Proceedings of the 8th international conference on Multimodal interfaces
Using conditional random fields for sentence boundary detection in speech

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
PCFGs with syntactic and prosodic indicators of speech repairs

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Incorporating gesture and gaze into multimodal models of human-to-human communication

NAACL-DocConsortium '06 Proceedings of the 2006 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: companion volume: doctoral consortium
Discovering Cues to Error Detection in Speech Recognition Output: A User-Centered Approach

Journal of Management Information Systems
Event Detection by HMM, SVM and ANN: A Comparative Study

PROPOR '08 Proceedings of the 8th international conference on Computational Processing of the Portuguese Language
A Comparison of Language Models for Dialog Act Segmentation of Meeting Transcripts

TSD '08 Proceedings of the 11th international conference on Text, Speech and Dialogue
Simultaneous translation of lectures and speeches

Machine Translation
Gesture salience as a hidden variable for coreference resolution and keyframe extraction

Journal of Artificial Intelligence Research
Multimodal floor control shift detection

Proceedings of the 2009 international conference on Multimodal interfaces
Towards using structural events to assess non-native speech

IUNLPBEA '10 Proceedings of the NAACL HLT 2010 Fifth Workshop on Innovative Use of NLP for Building Educational Applications
Utilizing gestures to improve sentence boundary detection

Multimedia Tools and Applications
Detecting structural events for assessing non-native speech

IUNLPBEA '11 Proceedings of the 6th Workshop on Innovative Use of NLP for Building Educational Applications
Design, creation, and analysis of Czech corpora for structural metadata extraction from speech

Language Resources and Evaluation
A multimodal analysis of floor control in meetings

MLMI'06 Proceedings of the Third international conference on Machine Learning for Multimodal Interaction

Quantified Score

Hi-index	0.00

Visualization

Abstract

Although speech recognition technology has significantly improved during the past few decades, current speech recognition systems output only a stream of words without providing other useful structural information that could aid a human reader and downstream language processing modules. This thesis research focuses on the automatic detection of several helpful structural events in speech, including sentence boundaries, type of utterance, filled pauses, discourse markers, and edit disfluencies. The systems evaluated combine prosodic cues and textual information sources in a variety of ways to support automatic detection of these structural events. Experiments were conducted across corpora (conversational speech and broadcast news speech) and with different transcription quality (human transcriptions versus recognition output). The imbalanced data problem is investigated for training the decision tree prosody model component of our system because structural events are much less frequent than non-events. A variety of sampling approaches and bagging are used to address this imbalance. Significant performance improvements are obtained via bagging. Some of the sampling methods are useful depending on the performance metrics used. Sentence boundary detection and disfluency detection tasks are impacted differently by sampling, bagging, and boosting, suggesting the inherent differences between the two tasks. A variety of methods for combining knowledge sources are examined: a hidden Markov model (HMM), the maximum entropy (Maxent) model, and the conditional random field (CRF). The Maxent and CRF approaches are discriminatively trained to model the posterior probabilities and thus correlate with the performance measures. They also support the use of more correlated features and so enable the combination of a variety of textual information sources. The HMM and CRF both model sequence information, unlike the Maxent which explicitly models local information. A model that combines these three approaches is superior to any method alone. Interactions with other research efforts suggest that the methods developed in this thesis generalize well to other corpora (e.g., a multimodal corpus, a multiparty meeting corpus) and to similar tasks (e.g., a gestural model, dialog act segmentation and classification).