Title extraction from bodies of HTML documents and its application to web page retrieval
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Data Mining, (First Edition)
Introduction to Data Mining, (First Edition)
A fast and robust method for web page template detection and removal
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Web page title extraction and its application
Information Processing and Management: an International Journal
Extracting article text from the web with maximum subsequence segmentation
Proceedings of the 18th international conference on World wide web
News article extraction with template-independent wrapper
Proceedings of the 18th international conference on World wide web
Can we learn a template-independent wrapper for news article extraction from a single training site?
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Learning to Extract Content from News Webpages
WAINA '09 Proceedings of the 2009 International Conference on Advanced Information Networking and Applications Workshops
Web article extraction for web printing: a DOM+visual based approach
Proceedings of the 9th ACM symposium on Document engineering
A fast and simple method for extracting relevant content from news webpages
Proceedings of the 18th ACM conference on Information and knowledge management
The WEKA data mining software: an update
ACM SIGKDD Explorations Newsletter
Document structure meets page layout: loopy random fields for web news content extraction
Proceedings of the 10th ACM symposium on Document engineering
A comparison of discriminative classifiers for web news content extraction
RIAO '10 Adaptivity, Personalization and Fusion of Heterogeneous Information
A first approach to the automatic recognition of structural patterns in XML documents
Proceedings of the 2012 ACM symposium on Document engineering
Hi-index | 0.00 |
We tackle the task of news webpage segmentation, specifically identifying the news title, publication date and story body. While there are very good results in the literature, most of them rely on webpage rendering, which is a very time-consuming step. We focus on scenarios with a high volume of documents, where performance is a must. The chosen approach extends our previous work in the area, combining structural properties with hints of visual presentation styles, computed with a quicker method than regular rendering, and machine learning algorithms. In our experiments, we took special attention to some aspects that are often overlooked in the literature, such as processing time and the generalization of the extraction results for unseen domains. Our approach has shown to be about an order of magnitude faster than an equivalent full rendering alternative while retaining a good quality of extraction.