Learning to Extract Content from News Webpages

Authors:
Alex Spengler;Patrick Gallinari
Affiliations:
-;-
Venue:
WAINA '09 Proceedings of the 2009 International Conference on Advanced Information Networking and Applications Workshops
Year:
2009

Citing 0
Cited 3

Document structure meets page layout: loopy random fields for web news content extraction

Proceedings of the 10th ACM symposium on Document engineering
A comparison of discriminative classifiers for web news content extraction

RIAO '10 Adaptivity, Personalization and Fusion of Heterogeneous Information
An efficient language-independent method to extract content from news webpages

Proceedings of the 11th ACM symposium on Document engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider the problem of content extraction from online news webpages. To explore to what extent the syntactic markup and the visual structure of a webpage facilitate the extraction of its content, we compare two state-of-the-art classifiers as first instantiations of a general framework that allows for proper model comparison. To this end, we introduce the publicly available NEWS600 corpus, a set of 604 real world news webpages which have been annotated with 30 semantic labels. An empirical analysis of the two models on this dataset shows that the inclusion of structural information is indeed advantageous.