Unsupervised pattern discovery for multimedia sequences

  • Authors:
  • Shih-Fu Chang;Lexing Xie

  • Affiliations:
  • Columbia University;Columbia University

  • Venue:
  • Unsupervised pattern discovery for multimedia sequences
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

This thesis investigates the problem of discovering patterns from multimedia sequences. The problem is of interest as capturing and storing large amounts of multimedia data has become commonplace, yet our capability to process, interpret, and use these rich corpora has notably lagged behind. Patterns refer to the recurrent and statistically consistent units in a data collection, their recurrence and consistency provide useful bases for organizing large corpra. Unsupervised pattern discovery is important, as it is desirable to adapt to diverse media collections without extensive annotation. Moreover, the patterns should be meaningful, since meanings are what we humans perceive from multimedia. The goal of this thesis is to devise a general framework for finding multi-modal temporal patterns from a collection of multimedia sequences, using the self-similarity in both the appearance and the temporal progression of the content. There, we have addressed three sub-problems: learning temporal pattern models, associating meanings with patterns, and finding patterns in multimodality. We propose novel models for the discovery of multimedia temporal patterns. We construct dynamic graphical models for capturing the multi-level dependency between the audio-visual observations and the events. We propose a stochastic search scheme for finding the optimal model size and topology, as well as unsupervised feature grouping for selecting relevant descriptors for temporal streams. We present novel approaches towards automatically explaining and evaluating the patterns in multimedia streams. Such approaches link the computational representations of the patterns with words in the video stream. The linking between the representation of audio-visual patterns, such as those acquired by a dynamic graphical model and the metadata, is achieved by statistical association. We develop solutions for finding patterns that reside across multiple modalities. This is realized with layered dynamic mixture model, and we address the modeling problems of intea-modality temporal dependency and inter-modality asynchrony in different parts of the model structure. With unsupervised pattern discovery, we are able to discover from baseball and soccer programs the common semantic states, play and break, with accuracies comparable to their supervised counterparts. On large broadcast news corpus we find that multimedia patterns have good correspondence with news topics that have salient audio-visual cues. These findings demonstrate the potential of our framework of mining multi-level temporal patterns from multimodal streams, and it has broad outlook in adapting to new content domains and extending to other applications such as event detection and information retrieval.