Unsupervised object of interest discovery in multi-view video sequence

Authors:
Thanaphat Thummanuntawat;Wuttipong Kumwilaisak;Jatuporn Chinrungrueng
Affiliations:
Communication and Multimedia Laboratory, Electronics and Telecommunication Department, King Mongkut's University of Technology, Thonburi, Bangkok, Thailand;Communication and Multimedia Laboratory, Electronics and Telecommunication Department, King Mongkut's University of Technology, Thonburi, Bangkok, Thailand;National Electronics and Computer Technology, Thailand
Venue:
ICACT'09 Proceedings of the 11th international conference on Advanced Communication Technology - Volume 3
Year:
2009

Citing 12
Cited 0

Learning Patterns of Activity Using Real-Time Tracking

IEEE Transactions on Pattern Analysis and Machine Intelligence
Unsupervised learning by probabilistic latent semantic analysis

Machine Learning
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
Distinctive Image Features from Scale-Invariant Keypoints

International Journal of Computer Vision
Discovering Objects and their Localization in Images

ICCV '05 Proceedings of the Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1 - Volume 01
Modeling Scenes with Local Descriptors and Latent Aspects

ICCV '05 Proceedings of the Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1 - Volume 01
Improving Color Based Video Shot Detection

ICMCS '99 Proceedings of the 1999 IEEE International Conference on Multimedia Computing and Systems - Volume 02
Motion-based background subtraction using adaptive kernel density estimation

CVPR'04 Proceedings of the 2004 IEEE computer society conference on Computer vision and pattern recognition
A multiview approach to tracking people in crowded scenes using a planar homography constraint

ECCV'06 Proceedings of the 9th European conference on Computer Vision - Volume Part IV
DISCOV: A Framework for Discovering Objects in Video

IEEE Transactions on Multimedia
A new diamond search algorithm for fast block-matching motion estimation

IEEE Transactions on Image Processing
Three-dimensional image processing in the future of immersive media

IEEE Transactions on Circuits and Systems for Video Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a novel algorithm in unsupervised object of interest discovery for multi-view video sequences. We classify a multi-view video sequence based on the degree of movement in a video sequence. In a video sequence with movement, we first group video frames along and across views as a group of picture (GOP). Key points or feature vectors representing textures existing in video frames in GOP are extracted using Scale-Invariant Feature Transform (SIFT). Key points are clustered using K-mean algorithm. Visual words are assigned to all key points based on their clusters. Patches represented small areas with textures are generated using the Maximally Stable Extremal Regions (MSER) operator. One patch can contain more than one key point, which leads to more than one visual word. Therefore, the patch can be represented by different visual words in different degrees. Motion detection algorithm is used to determine movement regions in video frames. Patches in the movement regions have higher likelihoods to be parts of the object of interest. With the developed spatial modeling and appearance modeling as well as the motion detection, we compute the likelihood which patches will belong to the object of interest. The group of patches with high likelihoods is clustered and indicated as the object of interest. When there are no or not significant movement, we assume that the human subjects are the most important objects in video sequences. A face detection algorithm is used to determine the location of the object of interest. When there are no human subjects in video sequences, the frequencies of visual words occurring in video sequences are used to identify the object of interest. This can be done because patches, which will be parts of the objects of interest, can be derived from the visual words. The experimental results in various types of multi-view video sequences show that our proposed algorithm can discover the objects of interest in multi-view video sequences correctly over 80% by average.