Automatic Web Pages Author Extraction

  • Authors:
  • Sahar Changuel;Nicolas Labroche;Bernadette Bouchon-Meunier

  • Affiliations:
  • Laboratoire d'Informatique de Paris 6 (LIP6), DAPA, LIP6, Paris, France 75016;Laboratoire d'Informatique de Paris 6 (LIP6), DAPA, LIP6, Paris, France 75016;Laboratoire d'Informatique de Paris 6 (LIP6), DAPA, LIP6, Paris, France 75016

  • Venue:
  • FQAS '09 Proceedings of the 8th International Conference on Flexible Query Answering Systems
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper addresses the problem of automatically extracting the author from heterogeneous HTML resources as a sub problem of automatic metadata extraction from (Web) documents. We take a supervised machine learning approach to address the problem using a C4.5 Decision Tree algorithm. The particularity of our approach is that it focuses on both, structure and contextual information. A semi-automatic approach was conducted for corpus expansion in order to help annotating the dataset with less human effort. This paper shows that our method can achieve good results (more than 80% in term of F1-measure) despite the heterogeneity of our corpus.