Modeling vocal interaction for segmentation in meeting recognition

Authors:
Kornel Laskowski;Tanja Schultz
Affiliations:
interACT, Carnegie Mellon University, Pittsburgh, PA;interACT, Carnegie Mellon University, Pittsburgh, PA
Venue:
MLMI'07 Proceedings of the 4th international conference on Machine learning for multimodal interaction
Year:
2007

Citing 3
Cited 2

A geometric interpretation of non-target-normalized maximum cross-channel correlation for vocal activity detection in meetings

NAACL-Short '07 Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers
Overlap in meetings: ASR effects and analysis by dialog factors, speakers, and collection site

MLMI'06 Proceedings of the Third international conference on Machine Learning for Multimodal Interaction
The ISL RT-06S speech-to-text system

MLMI'06 Proceedings of the Third international conference on Machine Learning for Multimodal Interaction

A Hybrid Generative-Discriminative Approach to Speaker Diarization

MLMI '08 Proceedings of the 5th international workshop on Machine Learning for Multimodal Interaction
Modeling norms of turn-taking in multi-party conversation

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Automatic segmentation is an important technology for both automatic speech recognition and automatic speech understanding. In meetings, participants typically vocalize for only a fraction of the recorded time, but standard vocal activity detection algorithms for closetalk microphones in meetings continue to treat participants independently. In this work we present a multispeaker segmentation system which models a particular aspect of human-human communication, that of vocal interaction or the interdependence between participants' on-off speech patterns. We describe our vocal interaction model, its training, and its use during vocal activity decoding. Our experiments show that this approach almost completely eliminates the problem of crosstalk, and word error rates on our development set are lower than those obtained with human-generatated reference segmentation. We also observe significant performance improvements on unseen data.