Crawling a country: better strategies than breadth-first for web page ordering

Authors:
Ricardo Baeza-Yates;Carlos Castillo;Mauricio Marin;Andrea Rodriguez
Affiliations:
Universidad de Chile;Universidad de Chile;Universidad de Magallanes;Universidad de Concepcion
Venue:
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Year:
2005

Citing 35
Cited 30

The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
SPHINX: a framework for creating personal, site-specific Web crawlers

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
The shark-search algorithm. An application: tailored Web site mapping

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Synchronizing a database to improve freshness

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Finding replicated Web collections

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
On near-uniform URL sampling

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
An adaptive model for optimizing performance of an incremental web crawler

Proceedings of the 10th international conference on World Wide Web
Breadth-first crawling yields high-quality pages

Proceedings of the 10th international conference on World Wide Web
Controlling the robots of Web search engines

Proceedings of the 2001 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Salticus: guided crawling for personal digital libraries

Proceedings of the 1st ACM/IEEE-CS joint conference on Digital libraries
Efficient web searching using temporal factors

Theoretical Computer Science
Searching the Web

ACM Transactions on Internet Technology (TOIT)
Evaluating topic-driven web crawlers

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Parallel crawlers

Proceedings of the 11th international conference on World Wide Web
Modern Information Retrieval

Modern Information Retrieval
Self-similarity in the web

ACM Transactions on Internet Technology (TOIT)
Discovery of Web Robot Sessions Based on their Navigational Patterns

Data Mining and Knowledge Discovery
Automating the Construction of Internet Portals with Machine Learning

Information Retrieval
Mercator: A scalable, extensible Web crawler

World Wide Web
Focused Crawling Using Context Graphs

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Crawling the Hidden Web

Proceedings of the 27th International Conference on Very Large Data Bases
Design and Implementation of a Distributed Crawler and Filtering Processor

NGITS '02 Proceedings of the 5th International Workshop on Next Generation Information Technologies and Systems
Web Structure, Dynamics and Page Quality

SPIRE 2002 Proceedings of the 9th International Symposium on String Processing and Information Retrieval
Adaptive on-line page importance computation

WWW '03 Proceedings of the 12th international conference on World Wide Web
CoBWeb A Crawler for the Brazilian Web

SPIRE '99 Proceedings of the String Processing and Information Retrieval Symposium & International Workshop on Groupware
Design and Implementation of a High-Performance Distributed Web Crawler

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
What's new on the web?: the evolution of the web from a search engine perspective

Proceedings of the 13th international conference on World Wide Web
Ranking the web frontier

Proceedings of the 13th international conference on World Wide Web
Performance and cost tradeoffs in Web search

ADC '04 Proceedings of the 15th Australasian database conference - Volume 27
Spam, damn spam, and statistics: using statistical analysis to locate spam web pages

Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004
High performance crawling system

Proceedings of the 6th ACM SIGMM international workshop on Multimedia information retrieval
Scheduling Algorithms for Web Crawling

LA-WEBMEDIA '04 Proceedings of the WebMedia & LA-Web 2004 Joint Conference 10th Brazilian Symposium on Multimedia and the Web 2nd Latin American Web Congress
UbiCrawler: a scalable fully distributed web crawler

Software—Practice & Experience

Evaluation of crawling policies for a web-repository crawler

Proceedings of the seventeenth conference on Hypertext and hypermedia
Estimating the global pagerank of web communities

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Architecture of a grid-enabled Web search engine

Information Processing and Management: an International Journal
Combining text and link analysis for focused crawling-An application for vertical search engines

Information Systems
On rank correlation in information retrieval evaluation

ACM SIGIR Forum
RankMass crawler: a crawler with high personalized pagerank coverage guarantee

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
iRobot: an intelligent crawler for web forums

Proceedings of the 17th international conference on World Wide Web
Exploring traversal strategy for web forum crawling

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Ant Focused Crawling Algorithm

ICAISC '08 Proceedings of the 9th international conference on Artificial Intelligence and Soft Computing
High-performance priority queues for parallel crawlers

Proceedings of the 10th ACM workshop on Web information and data management
On the feasibility of geographically distributed web crawling

Proceedings of the 3rd international conference on Scalable information systems
Sitemaps: above and beyond the crawl of duty

Proceedings of the 18th international conference on World wide web
Incorporating site-level knowledge for incremental crawling of web forums: a list-wise strategy

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Quantifying performance and quality gains in distributed web search engines

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
The impact of crawl policy on web search effectiveness

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Adaptive geospatially focused crawling

Proceedings of the 18th ACM conference on Information and knowledge management
Weighted Rank Correlation in Information Retrieval Evaluation

AIRS '09 Proceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology
Web Crawling

Foundations and Trends in Information Retrieval
The importance of anchor text for ad hoc search revisited

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Where to crawl next for focused crawlers

KES'10 Proceedings of the 14th international conference on Knowledge-based and intelligent information and engineering systems: Part IV
User browsing behavior-driven web crawling

Proceedings of the 20th ACM international conference on Information and knowledge management
Effectiveness beyond the first crawl tier

Proceedings of the 20th ACM international conference on Information and knowledge management
Algorithmic challenges in web search engines

WEA'06 Proceedings of the 5th international conference on Experimental Algorithms
Searching moving objects in a spatio-temporal distributed database servers system

ODBASE'06/OTM'06 Proceedings of the 2006 Confederated international conference on On the Move to Meaningful Internet Systems: CoopIS, DOA, GADA, and ODBASE - Volume Part II
Algorithmic challenges in web search engines

LATIN'06 Proceedings of the 7th Latin American conference on Theoretical Informatics
ARCOMEM: from collect-all ARchives to COmmunity MEMories

Proceedings of the 21st international conference companion on World Wide Web
A fast algorithm to find all high degree vertices in power law graphs

Proceedings of the 21st international conference companion on World Wide Web
A fast algorithm to find all high degree vertices in graphs with a power law degree sequence

WAW'12 Proceedings of the 9th international conference on Algorithms and Models for the Web Graph
Exploiting the social and semantic web for guided web archiving

TPDL'12 Proceedings of the Second international conference on Theory and Practice of Digital Libraries
Archival HTTP redirection retrieval policies

Proceedings of the 22nd international conference on World Wide Web companion

Quantified Score

Hi-index	0.00

Visualization

Abstract

This article compares several page ordering strategies for Web crawling under several metrics. The objective of these strategies is to download the most "important" pages "early" during the crawl. As the coverage of modern search engines is small compared to the size of the Web, and it is impossible to index all of the Web for both theoretical and practical reasons, it is relevant to index at least the most important pages.We use data from actual Web pages to build Web graphs and execute a crawler simulator on those graphs. As the Web is very dynamic, crawling simulation is the only way to ensure that all the strategies considered are compared under the same conditions. We propose several page ordering strategies that are more efficient than breadth- first search and strategies based on partial Pagerank calculations.