Further progress in meeting recognition: the ICSI-SRI spring 2005 speech-to-text evaluation system

Authors:
Andreas Stolcke;Xavier Anguera;Kofi Boakye;Özgür Çetin;František Grézl;Adam Janin;Arindam Mandal;Barbara Peskin;Chuck Wooters;Jing Zheng
Affiliations:
International Computer Science Institute, Berkeley, CA;International Computer Science Institute, Berkeley, CA;International Computer Science Institute, Berkeley, CA;International Computer Science Institute, Berkeley, CA;International Computer Science Institute, Berkeley, CA;International Computer Science Institute, Berkeley, CA;University of Washington, Seattle, WA;International Computer Science Institute, Berkeley, CA;International Computer Science Institute, Berkeley, CA;SRI International, Menlo Park, CA
Venue:
MLMI'05 Proceedings of the Second international conference on Machine Learning for Multimodal Interaction
Year:
2005

Citing 3
Cited 11

Investigation of silicon auditory models and generalization of linear discriminant analysis for improved speech recognition

Investigation of silicon auditory models and generalization of linear discriminant analysis for improved speech recognition
Getting more mileage from web text sources for conversational speech language modeling using class-dependent mixtures

NAACL-Short '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: companion volume of the Proceedings of HLT-NAACL 2003--short papers - Volume 2
Robust speaker segmentation for meetings: the ICSI-SRI spring 2005 diarization system

MLMI'05 Proceedings of the Second international conference on Machine Learning for Multimodal Interaction

Speaker Diarization For Multiple-Distant-Microphone Meetings Using Several Sources of Information

IEEE Transactions on Computers
Web resources for language modeling in conversational speech recognition

ACM Transactions on Speech and Language Processing (TSLP)
The SRI-ICSI Spring 2007 Meeting and Lecture Recognition System

Multimodal Technologies for Perception of Humans
The CALO meeting assistant system

IEEE Transactions on Audio, Speech, and Language Processing
Text based dialog act classification for multiparty meetings

MLMI'06 Proceedings of the Third international conference on Machine Learning for Multimodal Interaction
Overlap in meetings: ASR effects and analysis by dialog factors, speakers, and collection site

MLMI'06 Proceedings of the Third international conference on Machine Learning for Multimodal Interaction
Automatic cluster complexity and quantity selection: towards robust speaker diarization

MLMI'06 Proceedings of the Third international conference on Machine Learning for Multimodal Interaction
Speaker diarization for multi-microphone meetings using only between-channel differences

MLMI'06 Proceedings of the Third international conference on Machine Learning for Multimodal Interaction
Robust speaker diarization for meetings: ICSI RT06S meetings evaluation system

MLMI'06 Proceedings of the Third international conference on Machine Learning for Multimodal Interaction
The AMI meeting transcription system: progress and performance

MLMI'06 Proceedings of the Third international conference on Machine Learning for Multimodal Interaction
The ICSI-SRI spring 2006 meeting recognition system

MLMI'06 Proceedings of the Third international conference on Machine Learning for Multimodal Interaction

Quantified Score

Hi-index	0.00

Visualization

Abstract

We describe the development of our speech recognition system for the National Institute of Standards and Technology (NIST) Spring 2005 Meeting Rich Transcription (RT-05S) evaluation, highlighting improvements made since last year [1]. The system is based on the SRI-ICSI-UW RT-04F conversational telephone speech (CTS) recognition system, with meeting-adapted models and various audio preprocessing steps. This year's system features better delay-sum processing of distant microphone channels and energy-based crosstalk suppression for close-talking microphones. Acoustic modeling is improved by virtue of various enhancements to the background (CTS) models, including added training data, decision-tree based state tying, and the inclusion of discriminatively trained phone posterior features estimated by multilayer perceptrons. In particular, we make use of adaptation of both acoustic models and MLP features to the meeting domain. For distant microphone recognition we obtained considerable gains by combining and cross-adapting narrow-band (telephone) acoustic models with broadband (broadcast news) models. Language models (LMs) were improved with the inclusion of new meeting and web data. In spite of a lack of training data, we created effective LMs for the CHIL lecture domain. Results are reported on RT-04S and RT-05S meeting data. Measured on RT-04S conference data, we achieved an overall improvement of 17% relative in both MDM and IHM conditions compared to last year's evaluation system. Results on lecture data are comparable to the best reported results for that task.