Snap-and-ask: answering multimodal question by naming visual instance

Authors:
Wei Zhang;Lei Pang;Chong-Wah Ngo
Affiliations:
City University of Hong Kong, Hong Kong, Hong Kong;City University of Hong Kong, Hong Kong, Hong Kong;City University of Hong Kong, Hong Kong, Hong Kong
Venue:
Proceedings of the 20th ACM international conference on Multimedia
Year:
2012

Citing 22
Cited 3

Video Google: A Text Retrieval Approach to Object Matching in Videos

ICCV '03 Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2
VideoQA: question answering on news video

MULTIMEDIA '03 Proceedings of the eleventh ACM international conference on Multimedia
Question answering on lecture videos: a multifaceted approach

Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
Content-based multimedia information retrieval: State of the art and challenges

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
Scalable Recognition with a Vocabulary Tree

CVPR '06 Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2
Sentence Similarity Based on Semantic Nets and Corpus Statistics

IEEE Transactions on Knowledge and Data Engineering
Evaluation campaigns and TRECVid

MIR '06 Proceedings of the 8th ACM international workshop on Multimedia information retrieval
Practical elimination of near-duplicates from web video search

Proceedings of the 15th international conference on Multimedia
Query suggestions for mobile search: understanding usage patterns

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Annotating Images by Mining Image Search Results

IEEE Transactions on Pattern Analysis and Machine Intelligence
Photo-based question answering

MM '08 Proceedings of the 16th ACM international conference on Multimedia
The MIR flickr retrieval evaluation

MIR '08 Proceedings of the 1st ACM international conference on Multimedia information retrieval
A syntactic tree matching approach to finding similar questions in community-based qa services

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Mobile media search: has media search finally found its perfect platform? part II

MM '09 Proceedings of the 17th ACM international conference on Multimedia
Improving Bag-of-Features for Large Scale Image Search

International Journal of Computer Vision
Spatial coding for large scale partial-duplicate web image search

Proceedings of the international conference on Multimedia
Question Answering over Community-Contributed Web Videos

IEEE MultiMedia
Scalable triangulation-based logo recognition

Proceedings of the 1st ACM International Conference on Multimedia Retrieval
Multimedia answering: enriching text QA with media information

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Learning cooking techniques from youtube

MMM'10 Proceedings of the 16th international conference on Advances in Multimedia Modeling
On the Annotation of Web Videos by Efficient Near-Duplicate Search

IEEE Transactions on Multimedia
A Robust Passage Retrieval Algorithm for Video Question Answering

IEEE Transactions on Circuits and Systems for Video Technology

Searching visual instances with topology checking and context modeling

Proceedings of the 3rd ACM conference on International conference on multimedia retrieval
VIREO-VH: libraries and tools for threading and visualizing a large video collection

ACM SIGMultimedia Records
Model-based sparse component analysis for multiparty distant speech recognition: Afsaneh Asaei

ACM SIGMultimedia Records

Quantified Score

Hi-index	0.00

Visualization

Abstract

In real-life, it is easier to provide a visual cue when asking a question about a possibly unfamiliar topic, for example, asking the question, "Where was this crop circle found?". Providing an image of the instance is far more convenient than texting a verbose description of the visual properties, especially when the name of the query instance is not known. Nevertheless, having to identify the visual instance before processing the question and eventually returning the answer makes multimodal question-answering technically challenging. This paper addresses the problem of visual-to-text naming through the paradigm of answering-by-search in a two-stage computational framework, which is composed out of instance search (IS) and similar question ranking (QR). In IS, names of the instances are inferred from similar visual examples searched through a million-scale image dataset. For recalling instances of non-planar and non-rigid shapes, spatial configurations that emphasize topology consistency while allowing for local variations in matches have been incorporated. In QR, the candidate names of the instance are statistically identified from search results and directly utilized to retrieve similar questions from community-contributed QA (cQA) archives. By parsing questions into syntactic trees, a fuzzy matching between the inquirer's question and cQA questions is performed to locate answers and recommend related questions to the inquirer. The proposed framework is evaluated on a wide range of visual instances (e.g., fashion, art, food, pet, logo, and landmark) over various QA categories (e.g., factoid, definition, how-to, and opinion).