Automatic selection of print-worthy content for enhanced web page printing experience

  • Authors:
  • Suk Hwan Lim;Liwei Zheng;Jianming Jin;Huiman Hou;Jian Fan;Jerry Liu

  • Affiliations:
  • Hewlett-Packard Laboratories, Palo Alto, CA, USA;Hewlett-Packard Laboratories, Beijing, China;Hewlett-Packard Laboratories, Beijing, China;China HP Co Ltd, Beijing, China;Hewlett-Packard Laboratories, Palo Alto, CA, USA;Hewlett-Packard Laboratories, Palo Alto, CA, USA

  • Venue:
  • Proceedings of the 10th ACM symposium on Document engineering
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

The user experience of printing web pages has not been very good. Web pages typically contain contents that are not print-worthy or informative such as side bars, footers, headers, advertisements, and auxiliary information for further browsing. Since the inclusion of such contents degrades the web printing experience, we have developed a tool that first selects the main part of the web page automatically and then allows users to make adjustments. In this paper, we describe the algorithm for selecting the main content automatically during the first pass. The web page is first segmented into several coherent areas or blocks using our web page segmentation method that clusters content based on the affinity values between basic elements. The relative importance values for the segmented blocks are computed using various features and the main content is extracted based on the constraint of one DOM (Document Object Model) sub-tree and high important scores. We evaluated our algorithm on 65 web pages and computed the accuracy based on area of overlap between the ground truth and the extracted result of the algorithm.