Detection and separation of speech event using audio and video information fusion and its application to robust speech interface

Authors:
Futoshi Asano;Kiyoshi Yamamoto;Isao Hara;Jun Ogata;Takashi Yoshimura;Yoichi Motomura;Naoyuki Ichimura;Hideki Asoh
Affiliations:
Information Technology Research Institute, National Institute of Advanced Industrial Science and Technology, Tsukuba, Japan;Department of Computer Science, Tsukuba University, Tsukuba, Japan;Information Technology Research Institute, National Institute of Advanced Industrial Science and Technology, Tsukuba, Japan;Information Technology Research Institute, National Institute of Advanced Industrial Science and Technology, Tsukuba, Japan;Information Technology Research Institute, National Institute of Advanced Industrial Science and Technology, Tsukuba, Japan;Information Technology Research Institute, National Institute of Advanced Industrial Science and Technology, Tsukuba, Japan;Information Technology Research Institute, National Institute of Advanced Industrial Science and Technology, Tsukuba, Japan;Information Technology Research Institute, National Institute of Advanced Industrial Science and Technology, Tsukuba, Japan
Venue:
EURASIP Journal on Applied Signal Processing
Year:
2004

Citing 5
Cited 8

Digital smart kiosk project

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Bayesian Networks and Decision Graphs

Bayesian Networks and Decision Graphs
Array Signal Processing: Concepts and Techniques

Array Signal Processing: Concepts and Techniques
Background Modeling for Segmentation of Video-Rate Stereo Sequences

CVPR '98 Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Speaker Adaptation in the Philips System for Large Vocabulary Continuous Speech Recognition

ICASSP '97 Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97)-Volume 2 - Volume 2

Speaker localization for microphone array-based ASR: the effects of accuracy on overlapping speech

Proceedings of the 8th international conference on Multimodal interfaces
On building immersive audio applications using robust adaptive beamforming and joint audio-video source localization

EURASIP Journal on Applied Signal Processing
Detection and separation of speech events in meeting recordings using a microphone array

EURASIP Journal on Audio, Speech, and Music Processing
Automatic voice activity detection in different speech applications

Proceedings of the 1st international conference on Forensic applications and techniques in telecommunications, information, and multimedia and workshop
Signal Processing Techniques for Robust Speech Recognition

IEICE - Transactions on Information and Systems
A speaker diarization method based on the probabilistic fusion of audio-visual location information

Proceedings of the 2009 international conference on Multimodal interfaces
Dynamical information fusion of heterogeneous sensors for 3D tracking using particle swarm optimization

Information Fusion
Efficient blind dereverberation and echo cancellation based on independent component analysis for actual acoustic signals

Neural Computation

Quantified Score

Hi-index	0.00

Visualization

Abstract

A method of detecting speech events in a multiple-sound-source condition using audio and video information is proposed. For detecting speech events, sound localization using a microphone array and human tracking by stereo vision is combined by a Bayesian network. From the inference results of the Bayesian network, information on the time and location of speech events can be known. The information on the detected speech events is then utilized in the robust speech interface. A maximum likelihood adaptive beamformer is employed as a preprocessor of the speech recognizer to separate the speech signal from environmental noise. The coefficients of the beamformer are kept updated based on the information of the speech events. The information on the speech events is also used by the speech recognizer for extracting the speech segment.