Automatic Web Pages Author Extraction

Authors:
Sahar Changuel;Nicolas Labroche;Bernadette Bouchon-Meunier
Affiliations:
Laboratoire d'Informatique de Paris 6 (LIP6), DAPA, LIP6, Paris, France 75016;Laboratoire d'Informatique de Paris 6 (LIP6), DAPA, LIP6, Paris, France 75016;Laboratoire d'Informatique de Paris 6 (LIP6), DAPA, LIP6, Paris, France 75016
Venue:
FQAS '09 Proceedings of the 8th International Conference on Flexible Query Answering Systems
Year:
2009

Citing 11
Cited 2

Boosted Wrapper Induction

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
DOM-based content extraction of HTML documents

WWW '03 Proceedings of the 12th international conference on World Wide Web
Web-a-where: geotagging web content

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Unsupervised named-entity extraction from the web: an experimental study

Artificial Intelligence
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Educational data mining: A survey from 1995 to 2005

Expert Systems with Applications: An International Journal
Functionalities for automatic metadata generation applications: a survey of metadata experts' opinions

International Journal of Metadata, Semantics and Ontologies
Extracting the author of web pages

Proceedings of the 2nd ACM workshop on Information credibility on the web
A General Learning Method for Automatic Title Extraction from HTML Pages

MLDM '09 Proceedings of the 6th International Conference on Machine Learning and Data Mining in Pattern Recognition
Columbia Newsblaster: multilingual news summarization on the web

HLT-NAACL--Demonstrations '04 Demonstration Papers at HLT-NAACL 2004
Unsupervised named-entity recognition: generating gazetteers and resolving ambiguity

AI'06 Proceedings of the 19th international conference on Advances in Artificial Intelligence: Canadian Society for Computational Studies of Intelligence

Named entity recognition and identification for finding the owner of a home page

PAKDD'12 Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
Content independent metadata production as a machine learning problem

MLDM'12 Proceedings of the 8th international conference on Machine Learning and Data Mining in Pattern Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper addresses the problem of automatically extracting the author from heterogeneous HTML resources as a sub problem of automatic metadata extraction from (Web) documents. We take a supervised machine learning approach to address the problem using a C4.5 Decision Tree algorithm. The particularity of our approach is that it focuses on both, structure and contextual information. A semi-automatic approach was conducted for corpus expansion in order to help annotating the dataset with less human effort. This paper shows that our method can achieve good results (more than 80% in term of F1-measure) despite the heterogeneity of our corpus.