Turn the page: automated traversal of paginated websites

Authors:
Tim Furche;Giovanni Grasso;Andrey Kravchenko;Christian Schallhart
Affiliations:
Department of Computer Science, Oxford University, Oxford, UK;Department of Computer Science, Oxford University, Oxford, UK;Department of Computer Science, Oxford University, Oxford, UK;Department of Computer Science, Oxford University, Oxford, UK
Venue:
ICWE'12 Proceedings of the 12th international conference on Web Engineering
Year:
2012

Citing 18
Cited 0

Information retrieval in the World-Wide Web: making client-based searching feasible

Selected papers of the first conference on World-Wide Web
The shark-search algorithm. An application: tailored Web site mapping

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Focused Crawling Using Context Graphs

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Learning block importance models for web pages

Proceedings of the 13th international conference on World Wide Web
Automatic information extraction from large websites

Journal of the ACM (JACM)
Stylistic and lexical co-training for web block classification

Proceedings of the 6th annual ACM international workshop on Web information and data management
A General Evaluation Framework for Topical Crawlers

Information Retrieval
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
Learning to crawl: Comparing classification schemes

ACM Transactions on Information Systems (TOIS)
Link Contexts in Classifier-Guided Topical Crawlers

IEEE Transactions on Knowledge and Data Engineering
Using HMM to learn user browsing patterns for focused web crawling

Data & Knowledge Engineering - Special issue: WIDM 2004
Combining text and link analysis for focused crawling-An application for vertical search engines

Information Systems
Block Classification of a Web Page by Using a Combination of Multiple Classifiers

NCM '08 Proceedings of the 2008 Fourth International Conference on Networked Computing and Advanced Information Management - Volume 02
Can we learn a template-independent wrapper for news article extraction from a single training site?

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient record-level wrapper induction

Proceedings of the 18th ACM conference on Information and knowledge management
Schema-based Web wrapping

Knowledge and Information Systems
Text Processing with GATE

Text Processing with GATE

Quantified Score

Hi-index	0.00

Visualization

Abstract

Content-intensive web sites, such as Google or Amazon, paginate their results to accommodate limited screen sizes. Thus, human users and automatic tools alike have to traverse the pagination links when they crawl the site, extract data, or automate common tasks, where these applications require access to the entire result set. Previous approaches, as well as existing crawlers and automation tools, rely on simple heuristics (e.g., considering only the link text), falling back to an exhaustive exploration of the site where those heuristics fail. In particular, focused crawlers and data extraction systems target only fractions of the individual pages of a given site, rendering a highly accurate identification of pagination links essential to avoid the exhaustive exploration of irrelevant pages. We identify pagination links in a wide range of domains and sites with near perfect accuracy (99%). We obtain these results with a novel framework for web block classification, ${\textsc{ber}_y{\textsc l}}$, that combines rule-based reasoning for feature extraction and machine learning for feature selection and classification. Through this combination, ${\textsc{ber}_y{\textsc l}}$ is applicable in a wide settings range, adjusted to maximise either precision, recall, or speed. We illustrate how ${\textsc{ber}_y{\textsc l}}$ minimises the effort for feature extraction and evaluate the impact of a broad range of features (content, structural, and visual).