The anatomy of a large-scale hypertextual Web search engine
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Efficient crawling through URL ordering
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Focused crawling: a new approach to topic-specific Web resource discovery
WWW '99 Proceedings of the eighth international conference on World Wide Web
Authoritative sources in a hyperlinked environment
Journal of the ACM (JACM)
ACM Transactions on Internet Technology (TOIT)
Using Reinforcement Learning to Spider the Web Efficiently
ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Learning to Probabilistically Identify Authoritative Documents
ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Distributed Hypertext Resource Discovery Through Examples
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Focused Crawling Using Context Graphs
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Focused Crawls, Tunneling, and Digital Libraries
ECDL '02 Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries
Deriving link-context from HTML tag tree
DMKD '03 Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Link Contexts in Classifier-Guided Topical Crawlers
IEEE Transactions on Knowledge and Data Engineering
Hi-index | 0.00 |
At present, using focused crawler becomes a way to seek the needed information. The main characteristic of a focused web crawler is to select and retrieve only relevant web pages in each crawling process. In this paper, we propose a learnable algorithm that combines link analysis with web content in order to retrieve specific web documents, and it can predict the next URL through learning. The algorithm also uses an adaptive tunneling to overcome some of the limitations of normal focused crawlers. We apply three metrics to compare its efficiency with other well-known web crawling techniques based.