An efficient language-independent method to extract content from news webpages

Authors:
Eduardo Cardoso;Iam Jabour;Eduardo Laber;Rogério Rodrigues;Pedro Cardoso
Affiliations:
PUC-Rio, Rio de Janeiro, Brazil;PUC-Rio, Rio de Janeiro, Brazil;PUC-Rio, Rio de Janeiro, Brazil;Microsoft Corporation, Rio de Janeiro, Brazil;PUC-Rio, Rio de Janeiro, Brazil
Venue:
Proceedings of the 11th ACM symposium on Document engineering
Year:
2011

Citing 13
Cited 1

Title extraction from bodies of HTML documents and its application to web page retrieval

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Data Mining, (First Edition)

Introduction to Data Mining, (First Edition)
A fast and robust method for web page template detection and removal

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Web page title extraction and its application

Information Processing and Management: an International Journal
Extracting article text from the web with maximum subsequence segmentation

Proceedings of the 18th international conference on World wide web
News article extraction with template-independent wrapper

Proceedings of the 18th international conference on World wide web
Can we learn a template-independent wrapper for news article extraction from a single training site?

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Learning to Extract Content from News Webpages

WAINA '09 Proceedings of the 2009 International Conference on Advanced Information Networking and Applications Workshops
Web article extraction for web printing: a DOM+visual based approach

Proceedings of the 9th ACM symposium on Document engineering
A fast and simple method for extracting relevant content from news webpages

Proceedings of the 18th ACM conference on Information and knowledge management
The WEKA data mining software: an update

ACM SIGKDD Explorations Newsletter
Document structure meets page layout: loopy random fields for web news content extraction

Proceedings of the 10th ACM symposium on Document engineering
A comparison of discriminative classifiers for web news content extraction

RIAO '10 Adaptivity, Personalization and Fusion of Heterogeneous Information

A first approach to the automatic recognition of structural patterns in XML documents

Proceedings of the 2012 ACM symposium on Document engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

We tackle the task of news webpage segmentation, specifically identifying the news title, publication date and story body. While there are very good results in the literature, most of them rely on webpage rendering, which is a very time-consuming step. We focus on scenarios with a high volume of documents, where performance is a must. The chosen approach extends our previous work in the area, combining structural properties with hints of visual presentation styles, computed with a quicker method than regular rendering, and machine learning algorithms. In our experiments, we took special attention to some aspects that are often overlooked in the literature, such as processing time and the generalization of the extraction results for unseen domains. Our approach has shown to be about an order of magnitude faster than an equivalent full rendering alternative while retaining a good quality of extraction.