Automatic web news extraction using tree edit distance
Proceedings of the 13th international conference on World Wide Web
Automating Content Extraction of HTML Documents
World Wide Web
Extracting article text from the web with maximum subsequence segmentation
Proceedings of the 18th international conference on World wide web
Web article extraction for web printing: a DOM+visual based approach
Proceedings of the 9th ACM symposium on Document engineering
Hi-index | 0.00 |
Printing Web pages from browsers usually results in unsatisfactory printouts because the pages are typically ill formatted and contain non-informative content such as navigation menu and ads. Thus, print-worthy Web pages such as articles generally contain hyperlinks (or links) that lead to print-friendly pages containing the salient content. For a more desirable Web printing experience, the main Web content should be extracted to produce well formatted pages. This paper describes a cloud service based on automatic content extraction and repurposing from print-friendly pages for Web printing. Content extraction from print-friendly pages is simpler and more reliable than from the original pages, but there are many variations of the print-link representations in HTML that make robust print-link detection more difficult than it first appears. First, the link can be text-based, image-based, or both. For example, there is a lexicon of phrases used to indicate print-friendly pages, such as "print", "print article", "print-friendly version", etc. In addition, some links use printer-resembling image icons with or without a print phrase present. To complicate matter further, not all of the links contain a valid URL, but instead the pages are dynamically generated either by the client Javascript or by the server, so that no URL is present. Experimental results suggest that our solution is capable of achieving over 99% precision and 97% recall performance measures for print-friendly link extraction.