Information retrieval in the World-Wide Web: making client-based searching feasible
Selected papers of the first conference on World-Wide Web
The shark-search algorithm. An application: tailored Web site mapping
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Focused crawling: a new approach to topic-specific Web resource discovery
WWW '99 Proceedings of the eighth international conference on World Wide Web
Focused Crawling Using Context Graphs
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Learning block importance models for web pages
Proceedings of the 13th international conference on World Wide Web
Automatic information extraction from large websites
Journal of the ACM (JACM)
Stylistic and lexical co-training for web block classification
Proceedings of the 6th annual ACM international workshop on Web information and data management
A General Evaluation Framework for Topical Crawlers
Information Retrieval
Web data extraction based on partial tree alignment
WWW '05 Proceedings of the 14th international conference on World Wide Web
Learning to crawl: Comparing classification schemes
ACM Transactions on Information Systems (TOIS)
Link Contexts in Classifier-Guided Topical Crawlers
IEEE Transactions on Knowledge and Data Engineering
Using HMM to learn user browsing patterns for focused web crawling
Data & Knowledge Engineering - Special issue: WIDM 2004
Block Classification of a Web Page by Using a Combination of Multiple Classifiers
NCM '08 Proceedings of the 2008 Fourth International Conference on Networked Computing and Advanced Information Management - Volume 02
Can we learn a template-independent wrapper for news article extraction from a single training site?
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient record-level wrapper induction
Proceedings of the 18th ACM conference on Information and knowledge management
Knowledge and Information Systems
Text Processing with GATE
Hi-index | 0.00 |
Content-intensive web sites, such as Google or Amazon, paginate their results to accommodate limited screen sizes. Thus, human users and automatic tools alike have to traverse the pagination links when they crawl the site, extract data, or automate common tasks, where these applications require access to the entire result set. Previous approaches, as well as existing crawlers and automation tools, rely on simple heuristics (e.g., considering only the link text), falling back to an exhaustive exploration of the site where those heuristics fail. In particular, focused crawlers and data extraction systems target only fractions of the individual pages of a given site, rendering a highly accurate identification of pagination links essential to avoid the exhaustive exploration of irrelevant pages. We identify pagination links in a wide range of domains and sites with near perfect accuracy (99%). We obtain these results with a novel framework for web block classification, ${\textsc{ber}_y{\textsc l}}$, that combines rule-based reasoning for feature extraction and machine learning for feature selection and classification. Through this combination, ${\textsc{ber}_y{\textsc l}}$ is applicable in a wide settings range, adjusted to maximise either precision, recall, or speed. We illustrate how ${\textsc{ber}_y{\textsc l}}$ minimises the effort for feature extraction and evaluate the impact of a broad range of features (content, structural, and visual).