A novel focused crawler based on breadcrumb navigation

Authors:
Lizhi Ying;Xinhao Zhou;Jian Yuan;Yongfeng Huang
Affiliations:
Institute of Information Cognition and Intelligence System, Department of Electric Engineering, Tsinghua University, Beijing, China;Institute of Information Cognition and Intelligence System, Department of Electric Engineering, Tsinghua University, Beijing, China;Institute of Information Cognition and Intelligence System, Department of Electric Engineering, Tsinghua University, Beijing, China;Institute of Information Cognition and Intelligence System, Department of Electric Engineering, Tsinghua University, Beijing, China
Venue:
ICSI'12 Proceedings of the Third international conference on Advances in Swarm Intelligence - Volume Part II
Year:
2012

Citing 11
Cited 0

Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
A vector space model for automatic indexing

Communications of the ACM
Focused Crawling Using Context Graphs

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Probabilistic models for focused web crawling

Proceedings of the 6th annual ACM international workshop on Web information and data management
Fast webpage classification using URL features

Proceedings of the 14th ACM international conference on Information and knowledge management
Improving the performance of focused web crawlers

Data & Knowledge Engineering
Ontology-Based Intelligent Web Mining Agent for Taiwan Travel

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 03
OntoCrawler: A focused crawler with ontology-supported website models for information agents

Expert Systems with Applications: An International Journal
Focused crawling using navigational rank

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Where to crawl next for focused crawlers

KES'10 Proceedings of the 14th international conference on Knowledge-based and intelligent information and engineering systems: Part IV
Highly efficient algorithms for structural clustering of large websites

Proceedings of the 20th international conference on World wide web

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, a novel focused crawler based on Breadcrumb Navigation (BN) is proposed. It mainly leverages Breadcrumb Navigation in the webpages to reconstruct the website structures and resolve focused crawling problems. Different from some previous focused crawlers which use prediction models, the BN crawler firstly samples the web to construct the semantic forest for websites based on Breadcrumb Navigation, and then searches the forest to find the sub-trees relevant to the given topic. After sampling, the BN crawler only needs to download the webpages belonging to the relevant sub-forest. By using this method, the BN crawler costs less time to analyze the Webpage-to-Topic (W2T) similarity but results in a highly efficient performance. The experimental evidences show that the BN crawler significantly outperforms Breadth-First and Best-First crawlers in harvest ratio and can be widely used for most websites.