Article clipper: a system for web article extraction

Authors:
Jian Fan;Ping Luo;Suk Hwan Lim;Sam Liu;Parag Joshi;Jerry Liu
Affiliations:
Hewlett-Packard Labs, Palo Alto, CA, USA;Hewlett-Packard Labs, Beijing, China;Hewlett-Packard Labs, Palo Alto, CA, USA;Hewlett-Packard Labs, Palo Alto, CA, USA;Hewlett-Packard Labs, Palo Alto, CA, USA;Hewlett-Packard Labs, Palo Alto, CA, USA
Venue:
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2011

Citing 5
Cited 2

Automatic caption localization for photographs on World Wide Web pages

Information Processing and Management: an International Journal
Title extraction from bodies of HTML documents and its application to web page retrieval

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Extracting article text from the web with maximum subsequence segmentation

Proceedings of the 18th international conference on World wide web
Can we learn a template-independent wrapper for news article extraction from a single training site?

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Web article extraction for web printing: a DOM+visual based approach

Proceedings of the 9th ACM symposium on Document engineering

Harnessing the wisdom of the crowds for accurate web page clipping

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
TitleFinder: extracting the headline of news web pages based on cosine similarity and overlap scoring similarity

Proceedings of the twelfth international workshop on Web information and data management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many people use the Web as the main source of information in their daily lives. However, most web pages contain non-informative components such as side bars, footers, headers, and advertisements, which are undesirable for certain applications like printing. We demonstrate a system that automatically extracts the informative contents from news- and blog-like web pages. In contrast to many existing methods that are limited to identifying only the text or the bounding rectangular region, our system not only identifies the content but also the structural roles of various content components such as title, paragraphs, images and captions. The structural information enables re-layout of the content in a pleasing way. Besides the article text extraction, our system includes the following components: 1) print-link detection to identify the URL link for printing, and to use it for more reliable analysis and recognition; 2) title detection incorporating both visual cues and HTML tags; 3) image and caption detection utilizing extensive visual cues; 4) multiple-page and next page URL detection. The performance of our system has been thoroughly evaluated using a human labeled ground truth dataset consisting of 2000 web pages from 100 major web sites. We show accurate results using such a dataset.