Multimodal human discourse: gesture and speech
ACM Transactions on Computer-Human Interaction (TOCHI)
Multimedia content processing through cross-modal association
MULTIMEDIA '03 Proceedings of the eleventh ACM international conference on Multimedia
Multimodal concept-dependent active learning for image retrieval
Proceedings of the 12th annual ACM international conference on Multimedia
Co-Adaptation of audio-visual speech and gesture classifiers
Proceedings of the 8th international conference on Multimodal interfaces
A review of text and image retrieval approaches for broadcast news video
Information Retrieval
Image retrieval: Ideas, influences, and trends of the new age
ACM Computing Surveys (CSUR)
Annotating images and image objects using a hierarchical dirichlet process model
Proceedings of the 9th International Workshop on Multimedia Data Mining: held in conjunction with the ACM SIGKDD 2008
Foundations and Trends in Information Retrieval
SURF: speeded up robust features
ECCV'06 Proceedings of the 9th European conference on Computer Vision - Volume Part I
Hypernetworks: A Molecular Evolutionary Architecture for Cognitive Learning and Memory
IEEE Computational Intelligence Magazine
Hi-index | 0.00 |
Conventional methods for multimodal data retrieval use text-tag based or cross-modal approaches such as tag-image co-occurrence and canonical correlation analysis. Since there are differences of granularity in text and image features, however, approaches based on lower-order relationship between modalities may have limitations. Here, we propose a novel text and image keyword generation method by cross-modal associative learning and inference with multimodal queries. We use a modified hypernetwork model, i.e. layered hypernetworks (LHNs) which consists of the first (lower) layer and the second (upper) layer which has more than two modality-dependent hypernetworks and one modality-integrating hypernetwork, respectively. LHNs learn higher-order associative relationships between text and image modalities by training on an example set. After training, LHNs are used to extend multimodal queries by generating text and image keywords via cross-modal inference, i.e. text-to-image and image-to-text. The LHNs are evaluated on Korean magazine articles with images on women fashions and life-style. Experimental results show that the proposed method generates vision-language cross-modal keywords with high accuracy. The results also show that multimodal queries improve the accuracy of keyword generation compared with uni-modal ones.