Machine Learning
Automatic document metadata extraction using support vector machines
Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
Title extraction from bodies of HTML documents and its application to web page retrieval
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Automatic extraction of titles from general documents using machine learning
Information Processing and Management: an International Journal
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
International Journal of Metadata, Semantics and Ontologies
Automatic Web Pages Author Extraction
FQAS '09 Proceedings of the 8th International Conference on Flexible Query Answering Systems
Content independent metadata production as a machine learning problem
MLDM'12 Proceedings of the 8th international conference on Machine Learning and Data Mining in Pattern Recognition
Proceedings of the twelfth international workshop on Web information and data management
How can catchy titles be generated without loss of informativeness?
Expert Systems with Applications: An International Journal
Hi-index | 0.00 |
This paper addresses the problem of automatically learning the title metadata from HTML documents. The objective is to help indexing Web resources that are poorly annotated. Other works proposed similar objectives, but they considered only titles in text format. In this paper we propose a general learning schema that allows learning textual titles based on style information and image format titles based on image properties. We construct features from automatically annotated pages harvested from the Web; this paper details the corpus creation method as well as the information extraction techniques. Based on these features, learning algorithms, such as Decision Trees and Random Forest algorithms are applied achieving good results despite the heterogeneity of our corpus, we also show that combining both methods can induce better performance.