One word says more than a thousand pictures
Computers and Artificial Intelligence
Visual cognition and action (vol.2)
Intelligent multimedia interfaces
Intelligent multimedia interfaces
Visual semantics: extracting visual information from text accompanying pictures
AAAI '94 Proceedings of the twelfth national conference on Artificial intelligence (vol. 1)
Automatic depiction of spatial descriptions
AAAI'94 Proceedings of the twelfth national conference on Artificial intelligence (vol. 2)
Integrating qualitative and quantitative shape recovery
International Journal of Computer Vision
3-D reconstruction and camera calibration from images with know objects
BMVC '95 Proceedings of the 1995 British conference on Machine vision (Vol. 1)
Priors, preferences and categorical percepts
Perception as Bayesian inference
A three-dimensional spatial model for the interpretation of image data
Representation and processing of spatial expressions
Automatic Speech Recognition: The Development of the Sphinx Recognition System
Automatic Speech Recognition: The Development of the Sphinx Recognition System
Schritthaltende hybride Objektdetektion
Mustererkennung 1997, 19. DAGM-Symposium
LIP Motion Modeling and Speech Driven Estimation
ICASSP '97 Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97) -Volume 1 - Volume 1
Natural language driven image generation
ACL '84 Proceedings of the 10th International Conference on Computational Linguistics and 22nd annual meeting on Association for Computational Linguistics
Ubiquitous talker: spoken language interaction with real world objects
IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2
Projective relations for 3D space: computational model, application, and psychological evaluation
AAAI'97/IAAI'97 Proceedings of the fourteenth national conference on artificial intelligence and ninth conference on Innovative applications of artificial intelligence
Hi-index | 0.00 |
The interaction of image and speech processing is a crucial property of multimedia systems. Classical systems using inferences on pure qualitative high level descriptions miss a lot of information when concerned with erroneous, vague, or incomplete data. We propose a new architecture that integrates various levels of processing by using multiple representations of the visually observed scene. They are vertically connected by Bayesian networks in order to find the most plausible interpretation of the scene. The interpretation of a spoken utterance naming an object in the visually observed scene is modeled as another partial representation of the scene. Using this concept, the key problem is the identification of the verbally specified object instances in the visually observed scene. Therefore, a Bayesian network is generated dynamically from the spoken utterance and the visual scene representation. In this network spatial knowledge as well as knowledge extracted from psycholinguistic experiments is coded. First results show the robustness of our approach.