The development of the AMI system for the transcription of speech in meetings

Authors:
Thomas Hain;Lukas Burget;John Dines;Iain McCowan;Giulia Garau;Martin Karafiat;Mike Lincoln;Darren Moore;Vincent Wan;Roeland Ordelman;Steve Renals
Affiliations:
Department of Computer Science, University of Sheffield, Sheffield, UK;Faculty of Information Engineering, Brno University of Technology, Brno, Czech Republic;IDIAP, Martigny, Switzerland;IDIAP, Martigny, Switzerland;Centre for Speech Technology Research, University of Edinburgh, Edinburgh, UK;Faculty of Information Engineering, Brno University of Technology, Brno, Czech Republic;Centre for Speech Technology Research, University of Edinburgh, Edinburgh, UK;IDIAP, Martigny, Switzerland;Department of Computer Science, University of Sheffield, Sheffield, UK;Department of Electrical Engineering, University of Twente, Enschede, The Netherlands;Centre for Speech Technology Research, University of Edinburgh, Edinburgh, UK
Venue:
MLMI'05 Proceedings of the Second international conference on Machine Learning for Multimodal Interaction
Year:
2005

Citing 4
Cited 8

Broadcast News Transcription Using HTK

ICASSP '97 Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97)-Volume 2 - Volume 2
Investigation of silicon auditory models and generalization of linear discriminant analysis for improved speech recognition

Investigation of silicon auditory models and generalization of linear discriminant analysis for improved speech recognition
Getting more mileage from web text sources for conversational speech language modeling using class-dependent mixtures

NAACL-Short '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: companion volume of the Proceedings of HLT-NAACL 2003--short papers - Volume 2
The 1998 HTK system for transcription of conversational telephone speech

ICASSP '99 Proceedings of the Acoustics, Speech, and Signal Processing, 1999. on 1999 IEEE International Conference - Volume 01

Speaker localization for microphone array-based ASR: the effects of accuracy on overlapping speech

Proceedings of the 8th international conference on Multimodal interfaces
Web resources for language modeling in conversational speech recognition

ACM Transactions on Speech and Language Processing (TSLP)
The 2007 AMI(DA) System for Meeting Transcription

Multimodal Technologies for Perception of Humans
Role recognition for meeting participants: an approach based on lexical information and social network analysis

MM '08 Proceedings of the 16th ACM international conference on Multimedia
Using prosodic features in language models for meetings

MLMI'07 Proceedings of the 4th international conference on Machine learning for multimodal interaction
Error approximation and minimum phone error acoustic model estimation

IEEE Transactions on Audio, Speech, and Language Processing
The 2005 AMI system for the transcription of speech in meetings

MLMI'05 Proceedings of the Second international conference on Machine Learning for Multimodal Interaction
The AMI meeting transcription system: progress and performance

MLMI'06 Proceedings of the Third international conference on Machine Learning for Multimodal Interaction

Quantified Score

Hi-index	0.00

Visualization

Abstract

The automatic processing of speech collected in conference style meetings has attracted considerable interest with several large scale projects devoted to this area. This paper describes the development of a baseline automatic speech transcription system for meetings in the context of the AMI (Augmented Multiparty Interaction) project. We present several techniques important to processing of this data and show the performance in terms of word error rates (WERs). An important aspect of transcription of this data is the necessary flexibility in terms of audio pre-processing. Real world systems have to deal with flexible input, for example by using microphone arrays or randomly placed microphones in a room. Automatic segmentation and microphone array processing techniques are described and the effect on WERs is discussed. The system and its components presented in this paper yield competitive performance and form a baseline for future research in this domain.