Detection of a speaker in video by combined analysis of speech sound and mouth movement

Authors:
Osamu Ikeda
Affiliations:
Faculty of Engineering, Takushoku University, Hachioji, Tokyo, Japan
Venue:
ISVC'07 Proceedings of the 3rd international conference on Advances in visual computing - Volume Part II
Year:
2007

Citing 9
Cited 0

Neural Network-Based Face Detection

IEEE Transactions on Pattern Analysis and Machine Intelligence
Detecting Faces in Images: A Survey

IEEE Transactions on Pattern Analysis and Machine Intelligence
Name-It: Naming and Detecting Faces in News Videos

IEEE MultiMedia
Training Support Vector Machines: an Application to Face Detection

CVPR '97 Proceedings of the 1997 Conference on Computer Vision and Pattern Recognition (CVPR '97)
Probabilistic Modeling of Local Appearance and Spatial Relationships for Object Recognition

CVPR '98 Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Simultaneous alpha map generation and 2-D mesh tracking for multimedia applications

ICIP '97 Proceedings of the 1997 International Conference on Image Processing (ICIP '97) 3-Volume Set-Volume 1 - Volume 1
Experience based sampling technique for multimedia analysis

MULTIMEDIA '03 Proceedings of the eleventh ACM international conference on Multimedia
Eigenfaces for recognition

Journal of Cognitive Neuroscience
Unsupervised video segmentation based on watersheds and temporal tracking

IEEE Transactions on Circuits and Systems for Video Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a robust method to detect and locate a speaker using a joint analysis of speech sound and video image. First, the short speech sound data is analyzed to estimate the rate of spoken syllables, and a difference image is formed using the optimal frame distance derived from the rate to detect the candidates of mouth. Then, they are tracked to positively prove that one of the candidates is the mouth; the rate of mouth movements is estimated from the brightness change profiles for the first candidate and, if both the rates agree, the three brightest parts are detected in the resulting difference image as mouth and eyes. If not, the second candidate is tracked and so on. The first-order moment of the power spectrum of the brightness change profile and the lateral shifts in the tracking are also used to check whether or not they are facial parts.