The LIMSI Broadcast News transcription system
Speech Communication - Special issue on automatic transcription of broadcast news data
Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary
ECCV '02 Proceedings of the 7th European Conference on Computer Vision-Part IV
A systematic comparison of various statistical alignment models
Computational Linguistics
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
The Journal of Machine Learning Research
The mathematics of statistical machine translation: parameter estimation
Computational Linguistics - Special issue on using large corpora: II
Distinctive Image Features from Scale-Invariant Keypoints
International Journal of Computer Vision
Hidden Markov models for automatic annotation and content-based retrieval of images and video
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Modeling Scenes with Local Descriptors and Latent Aspects
ICCV '05 Proceedings of the Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1 - Volume 01
Multiple Bernoulli relevance models for image and video annotation
CVPR'04 Proceedings of the 2004 IEEE computer society conference on Computer vision and pattern recognition
Media objects for user-centered similarity matching
Multimedia Tools and Applications
Using visual-textual mutual information and entropy for inter-modal document indexing
ECIR'07 Proceedings of the 29th European conference on IR research
A relational vector space model using an advanced weighting scheme for image retrieval
Information Processing and Management: an International Journal
Hi-index | 0.00 |
We propose a new approach to recognize objects and scenes in news videos motivated by the availability of large video collections. This approach considers the recognition problem as the translation of visual elements to words. The correspondences between visual elements and words are learned using the methods adapted from statistical machine translation and used to predict words for particular image regions (region naming), for entire images (auto-annotation), or to associate the automatically generated speech transcript text with the correct video frames (video alignment). Experimental results are presented on TRECVID 2004 data set, which consists of about 150 hours of news videos associated with manual annotations and speech transcript text. The results show that the retrieval performance can be improved by associating visual and textual elements. Also, extensive analysis of features are provided and a method to combine features are proposed.