A fast and simple method for extracting relevant content from news webpages

  • Authors:
  • Eduardo Sany Laber;Críston Pereira de Souza;Iam Vita Jabour;Evelin Carvalho Freire de Amorim;Eduardo Teixeira Cardoso;Raúl Pierre Rentería;Lúcio Cunha Tinoco;Caio Dias Valentim

  • Affiliations:
  • PUC-Rio, Rio de Janeiro, Brazil;PUC-Rio, Rio de Janeiro, Brazil;PUC-Rio, Rio de Janeiro, Brazil;PUC-Rio, Rio de Janeiro, Brazil;PUC-Rio, Rio de Janeiro, Brazil;FAST, a Microsoft Subsidiary, Rio de Janeiro, Brazil;FAST, a Microsoft Subsidiary, Rio de Janeiro, Brazil;FAST, a Microsoft Subsidiary, Rio de Janeiro, Brazil

  • Venue:
  • Proceedings of the 18th ACM conference on Information and knowledge management
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

We propose NCE, an efficient algorithm to identify and extract relevant content from news webpages. We define relevant as the textual sections that more objectively describe the main event in the article. This includes the title and the main body section, and excludes comments about the story and presentation elements. Our experiments suggest that NCE is competitive, in terms of extraction quality, with the best methods available in the literature. It achieves F1 = 90.7% in our test corpus containing 324 news webpages from 22 sites. The main advantages of our method are its simplicity and its computational performance. It is at least an order of magnitude faster than methods that use visual features. This characteristic is very suitable for applications that process a large number of pages.