Adaptive Learning for Multimodal Fusion in Video Search

  • Authors:
  • Wen-Yu Lee;Po-Tun Wu;Winston Hsu

  • Affiliations:
  • National Taiwan University, Taiwan;National Taiwan University, Taiwan;National Taiwan University, Taiwan

  • Venue:
  • PCM '09 Proceedings of the 10th Pacific Rim Conference on Multimedia: Advances in Multimedia Information Processing
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Multimodal fusion had been shown prominent in video search for the sheer volume of video data. The state-of-the-art methods address the problem by query-dependent fusion, where modality weights vary across query classes (e.g., object, sports, scenes, people, etc.). However, provided the training queries, most of the prior methods rely on manually pre-defined query classes, ad-hoc query class classification, and heuristically determined fusion weights, which suffer from accuracy issues and are not scalable to large-scale data. Unlike prior methods, we propose an adaptive query learning framework for multimodal fusion. For each new query, we adopt ListNet to adaptively learn the fusion weights from its semantically-related training queries dynamically selected by K-nearest neighbor method. ListNet is efficient for optimizing the performance in search ranking rather than classification. In general, the proposed method has the following advantages: 1) No pre-defined query classes are needed. 2) The multimodal query weights are automatically and adaptively learned without ad-hoc hand-tuning. 3) The query training examples are selected according to the query semantics and require no noisy query classification. Experimenting in large-scale video benchmarks (i.e., TRECVID), we will show that the proposed method is scalable and competitive with prior query-dependent methods.