Using maximum entropy (ME) model to incorporate gesture cues for SU detection
Proceedings of the 8th international conference on Multimodal interfaces
Using conditional random fields for sentence boundary detection in speech
ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
PCFGs with syntactic and prosodic indicators of speech repairs
ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Incorporating gesture and gaze into multimodal models of human-to-human communication
NAACL-DocConsortium '06 Proceedings of the 2006 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: companion volume: doctoral consortium
Discovering Cues to Error Detection in Speech Recognition Output: A User-Centered Approach
Journal of Management Information Systems
Event Detection by HMM, SVM and ANN: A Comparative Study
PROPOR '08 Proceedings of the 8th international conference on Computational Processing of the Portuguese Language
A Comparison of Language Models for Dialog Act Segmentation of Meeting Transcripts
TSD '08 Proceedings of the 11th international conference on Text, Speech and Dialogue
Simultaneous translation of lectures and speeches
Machine Translation
Gesture salience as a hidden variable for coreference resolution and keyframe extraction
Journal of Artificial Intelligence Research
Multimodal floor control shift detection
Proceedings of the 2009 international conference on Multimodal interfaces
Towards using structural events to assess non-native speech
IUNLPBEA '10 Proceedings of the NAACL HLT 2010 Fifth Workshop on Innovative Use of NLP for Building Educational Applications
Utilizing gestures to improve sentence boundary detection
Multimedia Tools and Applications
Detecting structural events for assessing non-native speech
IUNLPBEA '11 Proceedings of the 6th Workshop on Innovative Use of NLP for Building Educational Applications
Design, creation, and analysis of Czech corpora for structural metadata extraction from speech
Language Resources and Evaluation
A multimodal analysis of floor control in meetings
MLMI'06 Proceedings of the Third international conference on Machine Learning for Multimodal Interaction
Hi-index | 0.00 |
Although speech recognition technology has significantly improved during the past few decades, current speech recognition systems output only a stream of words without providing other useful structural information that could aid a human reader and downstream language processing modules. This thesis research focuses on the automatic detection of several helpful structural events in speech, including sentence boundaries, type of utterance, filled pauses, discourse markers, and edit disfluencies. The systems evaluated combine prosodic cues and textual information sources in a variety of ways to support automatic detection of these structural events. Experiments were conducted across corpora (conversational speech and broadcast news speech) and with different transcription quality (human transcriptions versus recognition output). The imbalanced data problem is investigated for training the decision tree prosody model component of our system because structural events are much less frequent than non-events. A variety of sampling approaches and bagging are used to address this imbalance. Significant performance improvements are obtained via bagging. Some of the sampling methods are useful depending on the performance metrics used. Sentence boundary detection and disfluency detection tasks are impacted differently by sampling, bagging, and boosting, suggesting the inherent differences between the two tasks. A variety of methods for combining knowledge sources are examined: a hidden Markov model (HMM), the maximum entropy (Maxent) model, and the conditional random field (CRF). The Maxent and CRF approaches are discriminatively trained to model the posterior probabilities and thus correlate with the performance measures. They also support the use of more correlated features and so enable the combination of a variety of textual information sources. The HMM and CRF both model sequence information, unlike the Maxent which explicitly models local information. A model that combines these three approaches is superior to any method alone. Interactions with other research efforts suggest that the methods developed in this thesis generalize well to other corpora (e.g., a multimodal corpus, a multiparty meeting corpus) and to similar tasks (e.g., a gestural model, dialog act segmentation and classification).