FICA: A novel intelligent crawling algorithm based on reinforcement learning

Authors:
Ali Mohammad Zareh Bidoki;Nasser Yazdani;Pedram Ghodsnia
Affiliations:
Corresponding author. E-mail: zare_b@ece.ut.ac.ir. Phone: +98-21-66946927 Fax: +98-21-8497642;-;School of Electrical and Computer Engineering, University College of Engineering, University of Tehran, Tehran, Iran
Venue:
Web Intelligence and Agent Systems
Year:
2009

Citing 27
Cited 1

The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Authoritative sources in a hyperlinked environment

Journal of the ACM (JACM)
Graph structure in the Web

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Breadth-first crawling yields high-quality pages

Proceedings of the 10th international conference on World Wide Web
Searching the Web

ACM Transactions on Internet Technology (TOIT)
Parallel crawlers

Proceedings of the 11th international conference on World Wide Web
Accelerated focused crawling through online relevance feedback

Proceedings of the 11th international conference on World Wide Web
Topic-sensitive PageRank

Proceedings of the 11th international conference on World Wide Web
Introduction to Reinforcement Learning

Introduction to Reinforcement Learning
Hyperlink Analysis for the Web

IEEE Internet Computing
Using Reinforcement Learning to Spider the Web Efficiently

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Adaptive on-line page importance computation

WWW '03 Proceedings of the 12th international conference on World Wide Web
What's new on the web?: the evolution of the web from a search engine perspective

Proceedings of the 13th international conference on World Wide Web
Impact of search engines on page popularity

Proceedings of the 13th international conference on World Wide Web
Average-clicks: a new measure of distance on the World Wide Web

Journal of Intelligent Information Systems - Special issue on web intelligence
Topical web crawlers: Evaluating adaptive algorithms

ACM Transactions on Internet Technology (TOIT)
Scheduling Algorithms for Web Crawling

LA-WEBMEDIA '04 Proceedings of the WebMedia & LA-Web 2004 Joint Conference 10th Brazilian Symposium on Multimedia and the Web 2nd Latin American Web Congress
User-centric Web crawling

WWW '05 Proceedings of the 14th international conference on World Wide Web
The indexable web is more than 11.5 billion pages

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
PageRank revisited

ACM Transactions on Internet Technology (TOIT)
The discoverability of the web

Proceedings of the 16th international conference on World Wide Web
RankMass crawler: a crawler with high personalized pagerank coverage guarantee

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
FICA: A Fast Intelligent Crawling Algorithm

WI '07 Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence
DistanceRank: An intelligent ranking algorithm for web pages

Information Processing and Management: an International Journal
A punishment/reward based approach to ranking

Proceedings of the 2nd international conference on Scalable information systems

A novel crawling algorithm for web pages

AIRS'11 Proceedings of the 7th Asia conference on Information Retrieval Technology

Quantified Score

Hi-index	0.01

Visualization

Abstract

The web is a huge and highly dynamic environment which is growing exponentially in content and developing fast in structure. No search engine can cover the whole web, thus it has to focus on the most valuable pages for crawling. So an efficient crawling algorithm for retrieving the most important pages remains a challenging issue. Several algorithms like PageRank and OPIC have been proposed. Unfortunately, they have high time complexity and low throughput. In this paper, an intelligent crawling algorithm based on reinforcement learning, called FICA is proposed that models a random surfing user. The priority for crawling pages is based on a concept we call logarithmic distance. FICA is easy to implement and its time complexity is O(E*logV) where V and E are the number of nodes and edges in the web graph respectively. Comparison of FICA with other proposed algorithms shows that FICA outperforms them in discovering highly important pages. Furthermore, FICA computes the importance (ranking) of each page during the crawling process. Thus, we can also use FICA as a ranking method for computation of page importance. A nice property of FICA is its adaptability to the web in that it adjusts dynamically with changes in the web graph. We have used UK's web graph to evaluate our approach.