The 2005 AMI system for the transcription of speech in meetings

Authors:
Thomas Hain;Lukas Burget;John Dines;Giulia Garau;Martin Karafiat;Mike Lincoln;Iain McCowan;Darren Moore;Vincent Wan;Roeland Ordelman;Steve Renals
Affiliations:
Department of Computer Science, University of Sheffield, Sheffield, UK;Faculty of Information Engineering, Brno University of Technology, Brno, Czech Republic;IDIAP Research Institute, Martigny, Switzerland;Centre for Speech Technology Research, University of Edinburgh, Edinburgh, UK;Faculty of Information Engineering, Brno University of Technology, Brno, Czech Republic;Centre for Speech Technology Research, University of Edinburgh, Edinburgh, UK;IDIAP Research Institute, Martigny, Switzerland;IDIAP Research Institute, Martigny, Switzerland;Department of Computer Science, University of Sheffield, Sheffield, UK;Department of Electrical Engineering, University of Twente, Enschede, The Netherlands;Centre for Speech Technology Research, University of Edinburgh, Edinburgh, UK
Venue:
MLMI'05 Proceedings of the Second international conference on Machine Learning for Multimodal Interaction
Year:
2005

Citing 5
Cited 7

Broadcast News Transcription Using HTK

ICASSP '97 Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97)-Volume 2 - Volume 2
Investigation of silicon auditory models and generalization of linear discriminant analysis for improved speech recognition

Investigation of silicon auditory models and generalization of linear discriminant analysis for improved speech recognition
Getting more mileage from web text sources for conversational speech language modeling using class-dependent mixtures

NAACL-Short '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: companion volume of the Proceedings of HLT-NAACL 2003--short papers - Volume 2
The AMI meeting corpus: a pre-announcement

MLMI'05 Proceedings of the Second international conference on Machine Learning for Multimodal Interaction
The development of the AMI system for the transcription of speech in meetings

MLMI'05 Proceedings of the Second international conference on Machine Learning for Multimodal Interaction

Error approximation and minimum phone error acoustic model estimation

IEEE Transactions on Audio, Speech, and Language Processing
Ageing voices: the effect of changes in voice parameters on ASR performance

EURASIP Journal on Audio, Speech, and Music Processing - Special issue on atypical speech
Audio-Visual processing in meetings: seven questions and current AMI answers

MLMI'06 Proceedings of the Third international conference on Machine Learning for Multimodal Interaction
Robust heteroscedastic linear discriminant analysis and LCRC posterior features in meeting data recognition

MLMI'06 Proceedings of the Third international conference on Machine Learning for Multimodal Interaction
Juicer: a weighted finite-state transducer speech decoder

MLMI'06 Proceedings of the Third international conference on Machine Learning for Multimodal Interaction
The AMI meeting transcription system: progress and performance

MLMI'06 Proceedings of the Third international conference on Machine Learning for Multimodal Interaction
Approximate inference: A sampling based modeling technique to capture complex dependencies in a language model

Speech Communication

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we describe the 2005 AMI system for the transcription of speech in meetings used in the 2005 NIST RT evaluations. The system was designed for participation in the speech to text part of the evaluations, in particular for transcription of speech recorded with multiple distant microphones and independent headset microphones. System performance was tested on both conference room and lecture style meetings. Although input sources are processed using different front-ends, the recognition process is based on a unified system architecture. The system operates in multiple passes and makes use of state of the art technologies such as discriminative training, vocal tract length normalisation, heteroscedastic linear discriminant analysis, speaker adaptation with maximum likelihood linear regression and minimum word error rate decoding. In this paper we describe the system performance on the official development and test sets for the NIST RT05s evaluations. The system was jointly developed in less than 10 months by a multi-site team and was shown to achieve competitive performance.