Application of structured document parsing to focused web crawling

Authors:
Ahmed Patel;Nikita Schmidt
Affiliations:
-;-
Venue:
Computer Standards & Interfaces
Year:
2011

Citing 7
Cited 2

Pivoted document length normalization

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Accelerated focused crawling through online relevance feedback

Proceedings of the 11th international conference on World Wide Web
A Topic-Specific Web Robot Model Based on Restless Bandits

IEEE Internet Computing
Using Reinforcement Learning to Spider the Web Efficiently

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Topical web crawling using weighted anchor text and web page change detection techniques

WSEAS Transactions on Information Science and Applications
Improving the performance of focused web crawlers

Data & Knowledge Engineering

An analysis of web proxy logs with query distribution pattern approach for search engines

Computer Standards & Interfaces
An approach for selecting seed URLs of focused crawler based on user-interest ontology

Applied Soft Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The performance of a focused, or topic-specific Web robot can be improved by taking into consideration the structure of the documents downloaded by the robot. In the case of HTML, document structure is tree-like, defined by nested document elements (tags) and their attributes. By analysing this structure, a robot may use the text of certain HTML elements to prioritise documents for downloading and thus significantly improve the speed of convergence to a topic. Clear separation of the structure-aware document parser from the download scheduler provides flexibility but requires a standard interface and protocol between the two. The paper discusses such an interface in the context of an experimental Web robot, whose speed of convergence to a topic was observed to increase by a factor of 3 to 8, as measured by the number of documents downloaded to reach a given average relevance score.