An efficient language-independent method to extract content from news webpages

  • Authors:
  • Eduardo Cardoso;Iam Jabour;Eduardo Laber;Rogério Rodrigues;Pedro Cardoso

  • Affiliations:
  • PUC-Rio, Rio de Janeiro, Brazil;PUC-Rio, Rio de Janeiro, Brazil;PUC-Rio, Rio de Janeiro, Brazil;Microsoft Corporation, Rio de Janeiro, Brazil;PUC-Rio, Rio de Janeiro, Brazil

  • Venue:
  • Proceedings of the 11th ACM symposium on Document engineering
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

We tackle the task of news webpage segmentation, specifically identifying the news title, publication date and story body. While there are very good results in the literature, most of them rely on webpage rendering, which is a very time-consuming step. We focus on scenarios with a high volume of documents, where performance is a must. The chosen approach extends our previous work in the area, combining structural properties with hints of visual presentation styles, computed with a quicker method than regular rendering, and machine learning algorithms. In our experiments, we took special attention to some aspects that are often overlooked in the literature, such as processing time and the generalization of the extraction results for unseen domains. Our approach has shown to be about an order of magnitude faster than an equivalent full rendering alternative while retaining a good quality of extraction.