Stream weight estimation for multistream audio-visual speech recognition in a multispeaker environment

Authors:
Xu Shao;Jon Barker
Affiliations:
The University of Sheffield, Department of Computer Science, Regent Court, 211 Portobello Street, Sheffield S1 4DP, UK;The University of Sheffield, Department of Computer Science, Regent Court, 211 Portobello Street, Sheffield S1 4DP, UK
Venue:
Speech Communication
Year:
2008

Citing 9
Cited 4

A tutorial on hidden Markov models and selected applications in speech recognition

Readings in speech recognition
Continuous automatic speech recognition by lipreading

Continuous automatic speech recognition by lipreading
Frame-dependent multi-stream reliability indicators for audio-visual speech recognition

ICME '03 Proceedings of the 2003 International Conference on Multimedia and Expo - Volume 3 (ICME '03) - Volume 03
Adaptive bimodal sensor fusion for automatic speechreading

ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 02
Moving-talker, speaker-independent feature study, and baseline results using the CUAVE multimodal speech corpus

EURASIP Journal on Applied Signal Processing
Audio-visual speech modeling for continuous speech recognition

IEEE Transactions on Multimedia
A review of speech-based bimodal recognition

IEEE Transactions on Multimedia
Audio/visual mapping with cross-modal hidden Markov models

IEEE Transactions on Multimedia
Integration strategies for audio-visual speech processing: applied to text-dependent speaker recognition

IEEE Transactions on Multimedia

Energetic and informational masking effects in an audiovisual speech recognition system

IEEE Transactions on Audio, Speech, and Language Processing - Special issue on multimodal processing in speech-based interactions
Conjugate mixture models for clustering multimodal data

Neural Computation
Lip tracking method for the system of audio-visual polish speech recognition

ICAISC'12 Proceedings of the 11th international conference on Artificial Intelligence and Soft Computing - Volume Part I
Hierarchical ANN system for stuttering identification

Computer Speech and Language

Quantified Score

Hi-index	0.00

Visualization

Abstract

The paper considers the problem of audio-visual speech recognition in a simultaneous (target/masker) speaker environment. The paper follows a conventional multistream approach and examines the specific problem of estimating reliable time-varying audio and visual stream weights. The task is challenging because, in the two speaker condition, signal-to-noise ratio (SNR) - and hence audio stream weight - cannot always be reliably inferred from the acoustics alone. Similarity between the target and masker sound sources can cause the foreground and background to be confused. The paper presents a novel solution that combines both audio and visual information to estimate acoustic SNR. The method employs artificial neural networks to estimate the SNR from hidden Markov model (HMM) state-likelihoods calculated using separate audio and visual streams. SNR estimates are then mapped to either constant utterance-level (global) stream weights or time-varying frame-based (local) stream weights. The system has been evaluated using either gender dependent models that are specific to the target speaker, or gender independent models that discriminate poorly between target and masker. When using known SNR, the time-varying stream weight system outperforms the constant stream weight systems at all SNRs tested. It is thought that the time-vary weight allows the automatic speech recognition system to take advantage of regions where local SNRs are temporally high despite the global SNR being low. When using estimated SNR the time-varying system outperformed the constant stream weight system at SNRs of 0dB and above. Systems using stream weights estimated from both audio and video information performed better than those using stream weights estimated from the audio stream alone, particularly in the gender independent case. However, when mixtures are at a global SNR below 0dB, stream weights are not sufficiently well estimated to produce good performance. Methods for improving the SNR estimation are discussed. The paper also relates the use of visual information in the current system to its role in recent simultaneous speaker intelligibility studies, where, as well as providing phonetic content, it triggers 'informational masking release', helping the listener to attend selectively to the target speech stream.