Learning block importance models for web pages
Proceedings of the 13th international conference on World Wide Web
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Computing block importance for searching on web sites
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
PrintMarmoset: redesigning the print button for sustainability
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Extracting article text from the web with maximum subsequence segmentation
Proceedings of the 18th international conference on World wide web
Can we learn a template-independent wrapper for news article extraction from a single training site?
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Web article extraction for web printing: a DOM+visual based approach
Proceedings of the 9th ACM symposium on Document engineering
Web page cleaning for web mining through feature weighting
IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Harnessing the wisdom of the crowds for accurate web page clipping
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Hi-index | 0.00 |
The user experience of printing web pages has not been very good. Web pages typically contain contents that are not print-worthy or informative such as side bars, footers, headers, advertisements, and auxiliary information for further browsing. Since the inclusion of such contents degrades the web printing experience, we have developed a tool that first selects the main part of the web page automatically and then allows users to make adjustments. In this paper, we describe the algorithm for selecting the main content automatically during the first pass. The web page is first segmented into several coherent areas or blocks using our web page segmentation method that clusters content based on the affinity values between basic elements. The relative importance values for the segmented blocks are computed using various features and the main content is extracted based on the constraint of one DOM (Document Object Model) sub-tree and high important scores. We evaluated our algorithm on 65 web pages and computed the accuracy based on area of overlap between the ground truth and the extracted result of the algorithm.