A MAP-based Online Estimation Approach to Ensemble Speaker and Speaking Environment Modeling

Authors:
Yu Tsao;Shigeki Matsuda;Chiori Hori;Hideki Kashioka; Chin-Hui Lee
Affiliations:
Res. Center for Inf. Technol. Innovation (CITI), Acad. Sinica, Taipei, Taiwan;Spoken Language Commun. Lab., Nat. Inst. of Inf. & Commun. Technol. (NICT), Kyoto, Japan;Spoken Language Commun. Lab., Nat. Inst. of Inf. & Commun. Technol. (NICT), Kyoto, Japan;Spoken Language Commun. Lab., Nat. Inst. of Inf. & Commun. Technol. (NICT), Kyoto, Japan;Sch. of Electr. & Comput. Eng., Georgia Inst. of Technol., Atlanta, GA, USA
Venue:
IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP)
Year:
2014

Citing 13
Cited 0

Acoustical and environmental robustness in automatic speech recognition

Acoustical and environmental robustness in automatic speech recognition
Speech recognition in noisy environments: a survey

Speech Communication
MMIE training of large vocabulary recognition systems

Speech Communication
Speech recognition in noisy environments using first-order vector Taylor series

Speech Communication
Spoken Language Processing: A Guide to Theory, Algorithm, and System Development

Spoken Language Processing: A Guide to Theory, Algorithm, and System Development
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
Challenges in adopting speech recognition

Communications of the ACM - Multimodal interfaces that flex, adapt, and persist
A unified framework of HMM adaptation with joint compensation of additive and convolutive distortions

Computer Speech and Language
Ensemble speaker and speaking environment modeling approach with advanced online estimation process

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
A study on the generalization capability of acoustic models for robust speech recognition

IEEE Transactions on Audio, Speech, and Language Processing
Noise adaptive training for robust automatic speech recognition

IEEE Transactions on Audio, Speech, and Language Processing
An Ensemble Speaker and Speaking Environment Modeling Approach to Robust Speech Recognition

IEEE Transactions on Audio, Speech, and Language Processing
Approximate Test Risk Bound Minimization Through Soft Margin Estimation

IEEE Transactions on Audio, Speech, and Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

An ensemble speaker and speaking environment modeling (ESSEM) approach was recently developed. This ESSEM process consists of offline and online phases. The offline phase establishes an environment structure using speech data collected under a wide range of acoustic conditions, whereas the online phase estimates a set of acoustic models that matches the testing environment based on the established environment structure. Since the estimated acoustic models accurately characterize particular testing conditions, ESSEM can improve the speech recognition performance under adverse conditions. In this work, we propose two maximum a posteriori (MAP) based algorithms to improve the online estimation part of the original ESSEM framework. We first develop MAP-based environment structure adaptation to refine the original environment structure. Next, we propose to utilize the MAP criterion to estimate the mapping function of ESSEM and enhance the environment modeling capability. For the MAP estimation, three types of priors are derived; they are the clustered prior (CP), the sequential prior (SP), and the hierarchical prior (HP) densities. Since each prior density is able to characterize specific acoustic knowledge, we further derive a combination mechanism to integrate the three priors. Based on the experimental results on the Aurora-2 task, we verify that using the MAP-based online mapping function estimation can enable ESSEM to achieve better performance than using the maximum-likelihood (ML) based counterpart. Moreover, by using an integration of the online environment structuring adaptation and mapping function estimation, the proposed MAP-based ESSEM framework is found to provide the best performance. Compared with our baseline results, MAP-based ESSEM achieves an average word error rate reduction of 15.53% (5.41 to 4.57%) under 50 testing conditions at a signal-to-noise ratio (SNR) of 0 to 20 dB over the three standardized testing sets.