Multilevel Integration of Vision and Speech Understanding Using Bayesian Networks

Authors:
Sven Wachsmuth;Hans Brandt-Pook;Gudrun Socher;Franz Kummert;Gerhard Sagerer
Affiliations:
-;-;-;-;-
Venue:
ICVS '99 Proceedings of the First International Conference on Computer Vision Systems
Year:
1999

Citing 15
Cited 0

One word says more than a thousand pictures

Computers and Artificial Intelligence
Mental imagery

Visual cognition and action (vol.2)
Intelligent multimedia interfaces

Intelligent multimedia interfaces
Visual semantics: extracting visual information from text accompanying pictures

AAAI '94 Proceedings of the twelfth national conference on Artificial intelligence (vol. 1)
Automatic depiction of spatial descriptions

AAAI'94 Proceedings of the twelfth national conference on Artificial intelligence (vol. 2)
Integrating qualitative and quantitative shape recovery

International Journal of Computer Vision
3-D reconstruction and camera calibration from images with know objects

BMVC '95 Proceedings of the 1995 British conference on Machine vision (Vol. 1)
Priors, preferences and categorical percepts

Perception as Bayesian inference
A three-dimensional spatial model for the interpretation of image data

Representation and processing of spatial expressions
Automatic Speech Recognition: The Development of the Sphinx Recognition System

Automatic Speech Recognition: The Development of the Sphinx Recognition System
Schritthaltende hybride Objektdetektion

Mustererkennung 1997, 19. DAGM-Symposium
LIP Motion Modeling and Speech Driven Estimation

ICASSP '97 Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97) -Volume 1 - Volume 1
Natural language driven image generation

ACL '84 Proceedings of the 10th International Conference on Computational Linguistics and 22nd annual meeting on Association for Computational Linguistics
Ubiquitous talker: spoken language interaction with real world objects

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2
Projective relations for 3D space: computational model, application, and psychological evaluation

AAAI'97/IAAI'97 Proceedings of the fourteenth national conference on artificial intelligence and ninth conference on Innovative applications of artificial intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

The interaction of image and speech processing is a crucial property of multimedia systems. Classical systems using inferences on pure qualitative high level descriptions miss a lot of information when concerned with erroneous, vague, or incomplete data. We propose a new architecture that integrates various levels of processing by using multiple representations of the visually observed scene. They are vertically connected by Bayesian networks in order to find the most plausible interpretation of the scene. The interpretation of a spoken utterance naming an object in the visually observed scene is modeled as another partial representation of the scene. Using this concept, the key problem is the identification of the verbally specified object instances in the visually observed scene. Therefore, a Bayesian network is generated dynamically from the spoken utterance and the visual scene representation. In this network spatial knowledge as well as knowledge extracted from psycholinguistic experiments is coded. First results show the robustness of our approach.