The anatomy of a large-scale hypertextual Web search engine
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Efficient crawling through URL ordering
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Focused crawling: a new approach to topic-specific Web resource discovery
WWW '99 Proceedings of the eighth international conference on World Wide Web
Authoritative sources in a hyperlinked environment
Journal of the ACM (JACM)
Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Breadth-first crawling yields high-quality pages
Proceedings of the 10th international conference on World Wide Web
ACM Transactions on Internet Technology (TOIT)
Proceedings of the 11th international conference on World Wide Web
Accelerated focused crawling through online relevance feedback
Proceedings of the 11th international conference on World Wide Web
Proceedings of the 11th international conference on World Wide Web
Introduction to Reinforcement Learning
Introduction to Reinforcement Learning
Hyperlink Analysis for the Web
IEEE Internet Computing
Using Reinforcement Learning to Spider the Web Efficiently
ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Adaptive on-line page importance computation
WWW '03 Proceedings of the 12th international conference on World Wide Web
What's new on the web?: the evolution of the web from a search engine perspective
Proceedings of the 13th international conference on World Wide Web
Impact of search engines on page popularity
Proceedings of the 13th international conference on World Wide Web
Average-clicks: a new measure of distance on the World Wide Web
Journal of Intelligent Information Systems - Special issue on web intelligence
Topical web crawlers: Evaluating adaptive algorithms
ACM Transactions on Internet Technology (TOIT)
Scheduling Algorithms for Web Crawling
LA-WEBMEDIA '04 Proceedings of the WebMedia & LA-Web 2004 Joint Conference 10th Brazilian Symposium on Multimedia and the Web 2nd Latin American Web Congress
WWW '05 Proceedings of the 14th international conference on World Wide Web
The indexable web is more than 11.5 billion pages
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
ACM Transactions on Internet Technology (TOIT)
The discoverability of the web
Proceedings of the 16th international conference on World Wide Web
RankMass crawler: a crawler with high personalized pagerank coverage guarantee
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
FICA: A Fast Intelligent Crawling Algorithm
WI '07 Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence
DistanceRank: An intelligent ranking algorithm for web pages
Information Processing and Management: an International Journal
A punishment/reward based approach to ranking
Proceedings of the 2nd international conference on Scalable information systems
A novel crawling algorithm for web pages
AIRS'11 Proceedings of the 7th Asia conference on Information Retrieval Technology
Hi-index | 0.01 |
The web is a huge and highly dynamic environment which is growing exponentially in content and developing fast in structure. No search engine can cover the whole web, thus it has to focus on the most valuable pages for crawling. So an efficient crawling algorithm for retrieving the most important pages remains a challenging issue. Several algorithms like PageRank and OPIC have been proposed. Unfortunately, they have high time complexity and low throughput. In this paper, an intelligent crawling algorithm based on reinforcement learning, called FICA is proposed that models a random surfing user. The priority for crawling pages is based on a concept we call logarithmic distance. FICA is easy to implement and its time complexity is O(E*logV) where V and E are the number of nodes and edges in the web graph respectively. Comparison of FICA with other proposed algorithms shows that FICA outperforms them in discovering highly important pages. Furthermore, FICA computes the importance (ranking) of each page during the crawling process. Thus, we can also use FICA as a ranking method for computation of page importance. A nice property of FICA is its adaptability to the web in that it adjusts dynamically with changes in the web graph. We have used UK's web graph to evaluate our approach.