Soft margin estimation on improving environment structures for ensemble speaker and speaking environment modeling

Authors:
Yu Tsao;Jinyu Li;Chin-Hui Lee;Satoshi Nakamura
Affiliations:
National Institute of Information and Communications Technology, Kyoto, Japan;Microsoft, Redmond, WA;Georgia Institute of Technology, Atlanta, GA;National Institute of Information and Communications Technology, Kyoto, Japan
Venue:
Proceedings of the 3rd International Universal Communication Symposium
Year:
2009

Citing 5
Cited 0

The nature of statistical learning theory

The nature of statistical learning theory
Ensemble speaker and speaking environment modeling approach with advanced online estimation process

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
Soft margin estimation for automatic speech recognition

Soft margin estimation for automatic speech recognition
An Ensemble Speaker and Speaking Environment Modeling Approach to Robust Speech Recognition

IEEE Transactions on Audio, Speech, and Language Processing
Approximate Test Risk Bound Minimization Through Soft Margin Estimation

IEEE Transactions on Audio, Speech, and Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recently, we proposed an ensemble speaker and speaking environment modeling (ESSEM) approach to enhance the robustness of automatic speech recognition (ASR) under adverse conditions. The ESSEM framework comprises two phases, offline and online phases. In the offline phase, we prepare an environment structure that is formed by multiple sets of hidden Markov models (HMMs). Each HMM set represents a particular speaker and speaking environment. In the online phase, ESSEM estimates a mapping function to transform the prepared environment structure to a set of HMMs for the unknown testing condition. In this study, we incorporate the soft margin estimation (SME) to increase the discriminative power of the environment structure in the offline stage and therefore enhance the overall ESSEM performance. We evaluated the performance on the Aurora-2 connected digit database. With the SME refined environment structure, ESSEM provides better performance than the original framework. By using our best online mapping function, ESSEM achieves a word error rate (WER) of 4.62%, corresponding to 14.60% relative WER reduction (from 5.41% to 4.62%) over the best baseline performance of 5.41% WER.