The IBM RT07 Evaluation Systems for Speaker Diarization on Lecture Meetings
Multimodal Technologies for Perception of Humans
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Computer Speech and Language
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Unsupervised model adaptation using information-theoretic criterion
HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Advances in mandarin broadcast speech transcription at IBM under the DARPA GALE program
ISCSLP'06 Proceedings of the 5th international conference on Chinese Spoken Language Processing
Syntactic decision tree LMs: random selection or intelligent design?
EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Implicitly intersecting weighted automata using dual decomposition
NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Fast syntactic analysis for statistical language modeling via substructure sharing and uptraining
ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Revisiting the case for explicit syntactic information in language models
WLM '12 Proceedings of the NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-gram Model? On the Future of Language Modeling for HLT
Direct construction of compact context-dependency transducers from data
Computer Speech and Language
IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP)
Hi-index | 0.00 |
This paper describes the technical and system building advances made in IBM's speech recognition technology over the course of the Defense Advanced Research Projects Agency (DARPA) Effective Affordable Reusable Speech-to-Text (EARS) program. At a technical level, these advances include the development of a new form of feature-based minimum phone error training (fMPE), the use of large-scale discriminatively trained full-covariance Gaussian models, the use of septaphone acoustic context in static decoding graphs, and improvements in basic decoding algorithms. At a system building level, the advances include a system architecture based on cross-adaptation and the incorporation of 2100 h of training data in every system component. We present results on English conversational telephony test data from the 2003 and 2004 NIST evaluations. The combination of technical advances and an order of magnitude more training data in 2004 reduced the error rate on the 2003 test set by approximately 21% relative-from 20.4% to 16.1%-over the most accurate system in the 2003 evaluation and produced the most accurate results on the 2004 test sets in every speed category