A fast and simple method for extracting relevant content from news webpages

Authors:
Eduardo Sany Laber;Críston Pereira de Souza;Iam Vita Jabour;Evelin Carvalho Freire de Amorim;Eduardo Teixeira Cardoso;Raúl Pierre Rentería;Lúcio Cunha Tinoco;Caio Dias Valentim
Affiliations:
PUC-Rio, Rio de Janeiro, Brazil;PUC-Rio, Rio de Janeiro, Brazil;PUC-Rio, Rio de Janeiro, Brazil;PUC-Rio, Rio de Janeiro, Brazil;PUC-Rio, Rio de Janeiro, Brazil;FAST, a Microsoft Subsidiary, Rio de Janeiro, Brazil;FAST, a Microsoft Subsidiary, Rio de Janeiro, Brazil;FAST, a Microsoft Subsidiary, Rio de Janeiro, Brazil
Venue:
Proceedings of the 18th ACM conference on Information and knowledge management
Year:
2009

Citing 11
Cited 3

A brief survey of web data extraction tools

ACM SIGMOD Record
Improving pseudo-relevance feedback in web information retrieval using web page segmentation

WWW '03 Proceedings of the 12th international conference on World Wide Web
Automatic web news extraction using tree edit distance

Proceedings of the 13th international conference on World Wide Web
Tree-Structured Template Generation for Web Pages

WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
The volume and evolution of web page templates

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
A fast and robust method for web page template detection and removal

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Page-level template detection via isotonic smoothing

Proceedings of the 16th international conference on World Wide Web
Perception-oriented online news extraction

Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Template-independent news extraction based on visual consistency

AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 2
Web page cleaning for web mining through feature weighting

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
A heuristic approach for topical information extraction from news pages

WISE'06 Proceedings of the 7th international conference on Web Information Systems

An efficient language-independent method to extract content from news webpages

Proceedings of the 11th ACM symposium on Document engineering
Double dip map-reduce for processing cross validation jobs

Proceedings of the 27th Annual ACM Symposium on Applied Computing
Cluster-based page segmentation-a fast and precise method for web page pre-processing

Proceedings of the 3rd International Conference on Web Intelligence, Mining and Semantics

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose NCE, an efficient algorithm to identify and extract relevant content from news webpages. We define relevant as the textual sections that more objectively describe the main event in the article. This includes the title and the main body section, and excludes comments about the story and presentation elements. Our experiments suggest that NCE is competitive, in terms of extraction quality, with the best methods available in the literature. It achieves F1 = 90.7% in our test corpus containing 324 news webpages from 22 sites. The main advantages of our method are its simplicity and its computational performance. It is at least an order of magnitude faster than methods that use visual features. This characteristic is very suitable for applications that process a large number of pages.