A novel crawling algorithm for web pages

Authors:
Mohammad Amin Golshani;Vali Derhami;AliMohammad ZarehBidoki
Affiliations:
Department of Electrical and Computer Engineering, Yazd University, Yazd, Iran;Department of Electrical and Computer Engineering, Yazd University, Yazd, Iran;Department of Electrical and Computer Engineering, Yazd University, Yazd, Iran
Venue:
AIRS'11 Proceedings of the 7th Asia conference on Information Retrieval Technology
Year:
2011

Citing 16
Cited 0

The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Authoritative sources in a hyperlinked environment

Journal of the ACM (JACM)
Breadth-first crawling yields high-quality pages

Proceedings of the 10th international conference on World Wide Web
Topic-sensitive PageRank

Proceedings of the 11th international conference on World Wide Web
Introduction to Reinforcement Learning

Introduction to Reinforcement Learning
Adaptive on-line page importance computation

WWW '03 Proceedings of the 12th international conference on World Wide Web
Crawling the web: discovery and maintenance of large-scale web data

Crawling the web: discovery and maintenance of large-scale web data
Average-clicks: a new measure of distance on the World Wide Web

Journal of Intelligent Information Systems - Special issue on web intelligence
Scheduling Algorithms for Web Crawling

LA-WEBMEDIA '04 Proceedings of the WebMedia & LA-Web 2004 Joint Conference 10th Brazilian Symposium on Multimedia and the Web 2nd Latin American Web Congress
User-centric Web crawling

WWW '05 Proceedings of the 14th international conference on World Wide Web
The discoverability of the web

Proceedings of the 16th international conference on World Wide Web
Combating web spam with trustrank

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
RankMass crawler: a crawler with high personalized pagerank coverage guarantee

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
A three-year study on the freshness of web search engine databases

Journal of Information Science
FICA: A novel intelligent crawling algorithm based on reinforcement learning

Web Intelligence and Agent Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Crawler is a main component of search engines. In search engines, crawler part is responsible for discovering and downloading web pages. No search engine can cover whole of the web, thus it has to focus on the most valuable web pages. Several Crawling algorithms like PageRank, OPIC and FICA have been proposed, but they have low throughput. To overcome the problem, we propose a new crawling algorithm, called FICA+ which is easy to implement. In FICA+, importances of pages are determined based on the logarithmic distance and weight of the incoming links. To evaluate FICA+ we use web graph of university of California, Berkeley. Experimental result shows that our algorithm outperforms other crawling algorithms in discovering highly important pages.