Automatic selection of print-worthy content for enhanced web page printing experience

Authors:
Suk Hwan Lim;Liwei Zheng;Jianming Jin;Huiman Hou;Jian Fan;Jerry Liu
Affiliations:
Hewlett-Packard Laboratories, Palo Alto, CA, USA;Hewlett-Packard Laboratories, Beijing, China;Hewlett-Packard Laboratories, Beijing, China;China HP Co Ltd, Beijing, China;Hewlett-Packard Laboratories, Palo Alto, CA, USA;Hewlett-Packard Laboratories, Palo Alto, CA, USA
Venue:
Proceedings of the 10th ACM symposium on Document engineering
Year:
2010

Citing 8
Cited 1

Learning block importance models for web pages

Proceedings of the 13th international conference on World Wide Web
Block-level link analysis

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Computing block importance for searching on web sites

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
PrintMarmoset: redesigning the print button for sustainability

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Extracting article text from the web with maximum subsequence segmentation

Proceedings of the 18th international conference on World wide web
Can we learn a template-independent wrapper for news article extraction from a single training site?

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Web article extraction for web printing: a DOM+visual based approach

Proceedings of the 9th ACM symposium on Document engineering
Web page cleaning for web mining through feature weighting

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence

Harnessing the wisdom of the crowds for accurate web page clipping

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

The user experience of printing web pages has not been very good. Web pages typically contain contents that are not print-worthy or informative such as side bars, footers, headers, advertisements, and auxiliary information for further browsing. Since the inclusion of such contents degrades the web printing experience, we have developed a tool that first selects the main part of the web page automatically and then allows users to make adjustments. In this paper, we describe the algorithm for selecting the main content automatically during the first pass. The web page is first segmented into several coherent areas or blocks using our web page segmentation method that clusters content based on the affinity values between basic elements. The relative importance values for the segmented blocks are computed using various features and the main content is extracted based on the constraint of one DOM (Document Object Model) sub-tree and high important scores. We evaluated our algorithm on 65 web pages and computed the accuracy based on area of overlap between the ground truth and the extracted result of the algorithm.