A General Learning Method for Automatic Title Extraction from HTML Pages

Authors:
Sahar Changuel;Nicolas Labroche;Bernadette Bouchon-Meunier
Affiliations:
Laboratoire d'Informatique de Paris 6 (LIP6), DAPA, LIP6, Paris, France 75016;Laboratoire d'Informatique de Paris 6 (LIP6), DAPA, LIP6, Paris, France 75016;Laboratoire d'Informatique de Paris 6 (LIP6), DAPA, LIP6, Paris, France 75016
Venue:
MLDM '09 Proceedings of the 6th International Conference on Machine Learning and Data Mining in Pattern Recognition
Year:
2009

Citing 6
Cited 4

Random Forests

Machine Learning
Automatic document metadata extraction using support vector machines

Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
Title extraction from bodies of HTML documents and its application to web page retrieval

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Automatic extraction of titles from general documents using machine learning

Information Processing and Management: an International Journal
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Functionalities for automatic metadata generation applications: a survey of metadata experts' opinions

International Journal of Metadata, Semantics and Ontologies

Automatic Web Pages Author Extraction

FQAS '09 Proceedings of the 8th International Conference on Flexible Query Answering Systems
Content independent metadata production as a machine learning problem

MLDM'12 Proceedings of the 8th international conference on Machine Learning and Data Mining in Pattern Recognition
TitleFinder: extracting the headline of news web pages based on cosine similarity and overlap scoring similarity

Proceedings of the twelfth international workshop on Web information and data management
How can catchy titles be generated without loss of informativeness?

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper addresses the problem of automatically learning the title metadata from HTML documents. The objective is to help indexing Web resources that are poorly annotated. Other works proposed similar objectives, but they considered only titles in text format. In this paper we propose a general learning schema that allows learning textual titles based on style information and image format titles based on image properties. We construct features from automatically annotated pages harvested from the Web; this paper details the corpus creation method as well as the information extraction techniques. Based on these features, learning algorithms, such as Decision Trees and Random Forest algorithms are applied achieving good results despite the heterogeneity of our corpus, we also show that combining both methods can induce better performance.