Generating web-based corpora for video transcripts categorization

Authors:
José M. Perea-Ortega;Arturo Montejo-RáEz;M. Teresa MartíN-Valdivia;L. Alfonso UreñA-LóPez
Affiliations:
SINAI Research Group, Computer Science Department, University of Jaén, 23071 Jaén, Spain;SINAI Research Group, Computer Science Department, University of Jaén, 23071 Jaén, Spain;SINAI Research Group, Computer Science Department, University of Jaén, 23071 Jaén, Spain;SINAI Research Group, Computer Science Department, University of Jaén, 23071 Jaén, Spain
Venue:
Expert Systems with Applications: An International Journal
Year:
2013

Citing 17
Cited 0

Evaluating text categorization

HLT '91 Proceedings of the workshop on Speech and Natural Language
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Modern Information Retrieval

Modern Information Retrieval
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
SVM Classification Using Sequences of Phonemes and Syllables

PKDD '02 Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery
The SMART Retrieval System—Experiments in Automatic Document Processing

The SMART Retrieval System—Experiments in Automatic Document Processing
Evaluation campaigns and TRECVid

MIR '06 Proceedings of the 8th ACM international workshop on Multimedia information retrieval
New challenges in multimedia research for the increasingly connected and fast growing digital society

Proceedings of the international workshop on Workshop on multimedia information retrieval
Using information gain to improve multi-modal information retrieval systems

Information Processing and Management: an International Journal
Query expansion with a medical ontology to improve a multimodal information retrieval system

Computers in Biology and Medicine
Annotation of heterogeneous multimedia content using automatic speech recognition

SAMT'07 Proceedings of the semantic and digital media technologies 2nd international conference on Semantic Multimedia
Overview of the ImageCLEFphoto 2008 photographic retrieval task

CLEF'08 Proceedings of the 9th Cross-language evaluation forum conference on Evaluating systems for multilingual and multimodal information access
Overview of the ImageCLEFmed 2008 medical image retrieval task

CLEF'08 Proceedings of the 9th Cross-language evaluation forum conference on Evaluating systems for multilingual and multimodal information access
Overview of VideoCLEF 2008: automatic generation of topic-based feeds for dual language audio-visual content

CLEF'08 Proceedings of the 9th Cross-language evaluation forum conference on Evaluating systems for multilingual and multimodal information access
Using an information retrieval system for video classification

CLEF'08 Proceedings of the 9th Cross-language evaluation forum conference on Evaluating systems for multilingual and multimodal information access
Using web sources for improving video categorization

Journal of Intelligent Information Systems
Proceedings of the 1st ACM International Conference on Multimedia Retrieval

International Conference on Multimedia Retrieval

Quantified Score

Hi-index	12.05

Visualization

Abstract

This paper proposes the use of Internet as a rich source of information in order to generate learning corpora for video transcripts categorization systems. Our main goal in this work has been to study the behavior of different learning corpora generated from the Internet and analyze some of their features. Specifically, Wikipedia, Google and the blogosphere have been employed to generate these learning corpora, using the VideoCLEF 2008 track as the evaluation framework for the different experiments carried out. Based on this evaluation framework, we conclude that the proposed approach is a promising strategy for the video classification task using the transcripts of the videos. The different sizes of the corpora generated could lead to believe that better results are achieved when the corpus size is larger, but we demonstrate that this feature may not always be a reliable indicator of the behavior of the learning corpus. The obtained results show that the integration of knowledge from the blogosphere or Google allows generating more reliable corpora for this task than those based on Wikipedia.