Focused crawling: a new approach to topic-specific Web resource discovery
WWW '99 Proceedings of the eighth international conference on World Wide Web
A vector space model for automatic indexing
Communications of the ACM
Focused Crawling Using Context Graphs
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Probabilistic models for focused web crawling
Proceedings of the 6th annual ACM international workshop on Web information and data management
Fast webpage classification using URL features
Proceedings of the 14th ACM international conference on Information and knowledge management
Improving the performance of focused web crawlers
Data & Knowledge Engineering
Ontology-Based Intelligent Web Mining Agent for Taiwan Travel
WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 03
OntoCrawler: A focused crawler with ontology-supported website models for information agents
Expert Systems with Applications: An International Journal
Focused crawling using navigational rank
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Where to crawl next for focused crawlers
KES'10 Proceedings of the 14th international conference on Knowledge-based and intelligent information and engineering systems: Part IV
Highly efficient algorithms for structural clustering of large websites
Proceedings of the 20th international conference on World wide web
Hi-index | 0.00 |
In this paper, a novel focused crawler based on Breadcrumb Navigation (BN) is proposed. It mainly leverages Breadcrumb Navigation in the webpages to reconstruct the website structures and resolve focused crawling problems. Different from some previous focused crawlers which use prediction models, the BN crawler firstly samples the web to construct the semantic forest for websites based on Breadcrumb Navigation, and then searches the forest to find the sub-trees relevant to the given topic. After sampling, the BN crawler only needs to download the webpages belonging to the relevant sub-forest. By using this method, the BN crawler costs less time to analyze the Webpage-to-Topic (W2T) similarity but results in a highly efficient performance. The experimental evidences show that the BN crawler significantly outperforms Breadth-First and Best-First crawlers in harvest ratio and can be widely used for most websites.