Cross-modal prediction in audio-visual communication

Authors:
R. R. Rao;Tsuhan Chen
Affiliations:
Georgia Inst. of Technol., Atlanta, GA, USA;-
Venue:
ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 04
Year:
1996

Citing 0
Cited 6

Speech-to-Lip Movement Synthesis by Maximizing Audio-Visual Joint Probability Based on the EM Algorithm

Journal of VLSI Signal Processing Systems - Special issue on multimedia signal processing
Fusion of Audio-Visual Information for Integrated Speech Processing

AVBPA '01 Proceedings of the Third International Conference on Audio- and Video-Based Biometric Person Authentication
Sensor fusion weighting measures in Audio-Visual Speech Recognition

ACSC '04 Proceedings of the 27th Australasian conference on Computer science - Volume 26
Visual speaker localization aided by acoustic models

MM '09 Proceedings of the 17th ACM international conference on Multimedia
Dialocalization: Acoustic speaker diarization and visual localization as joint optimization problem

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
A review on speaker diarization systems and approaches

Speech Communication

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a novel means for predicting the shape of a person's mouth from the corresponding speech signal and explore applications of this prediction to video coding. The prediction is accomplished by modeling the probability distribution of the audiovisual features by a Gaussian mixture density. The optimal estimate for the visual features given the acoustic features can then be computed using this probability distribution. The ability to predict a person's mouth shape from the corresponding audio leads to a number of interesting joint audio-video coding strategies. In the cross-modal predictive coding system described, a model-based video coder compares measured visual parameters with predicted visual parameters, and sends the difference between the two to the receiver. Since the decoder also receives the acoustic data, it can form the prediction and then reconstruct the original parameters by adding the transmitted error signal.