Multimodal recognition of visual concepts using histograms of textual concepts and selective weighted late fusion scheme

Authors:
Ningning Liu;Emmanuel DellandréA;Liming Chen;Chao Zhu;Yu Zhang;Charles-Edmond Bichot;StéPhane Bres;Bruno Tellez
Affiliations:
Université de Lyon, CNRS, France and Ecole Centrale de Lyon, LIRIS, UMR 5205, F-69134, France;Université de Lyon, CNRS, France and Ecole Centrale de Lyon, LIRIS, UMR 5205, F-69134, France;Université de Lyon, CNRS, France and Ecole Centrale de Lyon, LIRIS, UMR 5205, F-69134, France;Université de Lyon, CNRS, France and Ecole Centrale de Lyon, LIRIS, UMR 5205, F-69134, France;Université de Lyon, CNRS, France and Ecole Centrale de Lyon, LIRIS, UMR 5205, F-69134, France;Université de Lyon, CNRS, France and Ecole Centrale de Lyon, LIRIS, UMR 5205, F-69134, France;Université de Lyon, CNRS, France and INSA-Lyon, LIRIS, UMR 5205, F-69621, France;Université de Lyon, CNRS, France and Université Lyon 1, LIRIS, UMR 5205, F-69622, France
Venue:
Computer Vision and Image Understanding
Year:
2013

Citing 49
Cited 0

The nature of statistical learning theory

The nature of statistical learning theory
WordNet: a lexical database for English

Communications of the ACM
Bagging predictors

Machine Learning
On Combining Classifiers

IEEE Transactions on Pattern Analysis and Machine Intelligence
Probabilistic latent semantic indexing

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Content-Based Image Retrieval at the End of the Early Years

IEEE Transactions on Pattern Analysis and Machine Intelligence
A vector space model for automatic indexing

Communications of the ACM
Use of the Hough transformation to detect lines and curves in pictures

Communications of the ACM
Semantics in Visual Information Retrieval

IEEE MultiMedia
Automatic Linguistic Indexing of Pictures by a Statistical Modeling Approach

IEEE Transactions on Pattern Analysis and Machine Intelligence
Latent dirichlet allocation

The Journal of Machine Learning Research
Video Google: A Text Retrieval Approach to Object Matching in Videos

ICCV '03 Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2
Semantic-Friendly Indexing and Quering of Images Based on the Extraction of the Objective Semantic Cues

International Journal of Computer Vision - Special Issue on Content-Based Image Retrieval
Robust Real-Time Face Detection

International Journal of Computer Vision
Scale & Affine Invariant Interest Point Detectors

International Journal of Computer Vision
Distinctive Image Features from Scale-Invariant Keypoints

International Journal of Computer Vision
Optimal multimodal fusion for multimedia data analysis

Proceedings of the 12th annual ACM international conference on Multimedia
Distributional term representations: an experimental comparison

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Histograms of Oriented Gradients for Human Detection

CVPR '05 Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 1 - Volume 01
A Bayesian Hierarchical Model for Learning Natural Scene Categories

CVPR '05 Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 2 - Volume 02
Antonymy and conceptual vectors

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Early versus late fusion in semantic video analysis

Proceedings of the 13th annual ACM international conference on Multimedia
Early versus late fusion in semantic video analysis

Proceedings of the 13th annual ACM international conference on Multimedia
Content-based image retrieval: approaches and trends of the new age

Proceedings of the 7th ACM SIGMM international workshop on Multimedia information retrieval
Content-based multimedia information retrieval: State of the art and challenges

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
The Design of High-Level Features for Photo Quality Assessment

CVPR '06 Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 1
Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories

CVPR '06 Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2
Evaluation campaigns and TRECVid

MIR '06 Proceedings of the 8th ACM international workshop on Multimedia information retrieval
The mediamill large.lexicon concept suggestion engine

MULTIMEDIA '06 Proceedings of the 14th annual ACM international conference on Multimedia
Using bag-of-concepts to improve the performance of support vector machines in text categorization

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study

International Journal of Computer Vision
A probabilistic multimedia retrieval model and its evaluation

EURASIP Journal on Applied Signal Processing
The MIR flickr retrieval evaluation

MIR '08 Proceedings of the 1st ACM international conference on Multimedia information retrieval
Content-based mood classification for photos and music: a generic multi-modal classification framework and evaluation approach

MIR '08 Proceedings of the 1st ACM international conference on Multimedia information retrieval
Crossing textual and visual content in different application scenarios

Multimedia Tools and Applications
Line segment based edge feature using Hough transform

VIIP '07 The Seventh IASTED International Conference on Visualization, Imaging and Image Processing
Ensemble-based classifiers

Artificial Intelligence Review
New trends and ideas in visual concept detection: the MIR flickr retrieval evaluation initiative

Proceedings of the international conference on Multimedia information retrieval
The Pascal Visual Object Classes (VOC) Challenge

International Journal of Computer Vision
DAISY: An Efficient Dense Descriptor Applied to Wide-Baseline Stereo

IEEE Transactions on Pattern Analysis and Machine Intelligence
Integrating structure and meaning: a new method for encoding structure for text classification

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
Evaluating Color Descriptors for Object and Scene Recognition

IEEE Transactions on Pattern Analysis and Machine Intelligence
Affective image classification using features inspired by psychology and art theory

Proceedings of the international conference on Multimedia
Multi-scale Color Local Binary Patterns for Visual Object Classes Recognition

ICPR '10 Proceedings of the 2010 20th International Conference on Pattern Recognition
LIBSVM: A library for support vector machines

ACM Transactions on Intelligent Systems and Technology (TIST)
Associating textual features with visual ones to improve affective image classification

ACII'11 Proceedings of the 4th international conference on Affective computing and intelligent interaction - Volume Part I
Multimodal data fusion for video scene segmentation

VISUAL'05 Proceedings of the 8th international conference on Visual Information and Information Systems
Multimodal indexing based on semantic cohesion for image retrieval

Information Retrieval
Visual object recognition using DAISY descriptor

ICME '11 Proceedings of the 2011 IEEE International Conference on Multimedia and Expo

Quantified Score

Hi-index	0.00

Visualization

Abstract

The text associated with images provides valuable semantic meanings about image content that can hardly be described by low-level visual features. In this paper, we propose a novel multimodal approach to automatically predict the visual concepts of images through an effective fusion of textual features along with visual ones. In contrast to the classical Bag-of-Words approach which simply relies on term frequencies, we propose a novel textual descriptor, namely the Histogram of Textual Concepts (HTC), which accounts for the relatedness of semantic concepts in accumulating the contributions of words from the image caption toward a dictionary. In addition to the popular SIFT-like features, we also evaluate a set of mid-level visual features, aiming at characterizing the harmony, dynamism and aesthetic quality of visual content, in relationship with affective concepts. Finally, a novel selective weighted late fusion (SWLF) scheme is proposed to automatically select and weight the scores from the best features according to the concept to be classified. This scheme proves particularly useful for the image annotation task with a multi-label scenario. Extensive experiments were carried out on the MIR FLICKR image collection within the ImageCLEF 2011 photo annotation challenge. Our best model, which is a late fusion of textual and visual features, achieved a MiAP (Mean interpolated Average Precision) of 43.69% and ranked 2nd out of 79 runs. We also provide comprehensive analysis of the experimental results and give some insights for future improvements.