Optimal crawling strategies for web search engines

Authors:
J. L. Wolf;M. S. Squillante;P. S. Yu;J. Sethuraman;L. Ozsen
Affiliations:
IBM Watson Research Center, Yorktown Heights, NY;IBM Watson Research Center, Yorktown Heights, NY;IBM Watson Research Center, Yorktown Heights, NY;Columbia University, New York, NY;Northwestern University, Evanston, IL
Venue:
Proceedings of the 11th international conference on World Wide Web
Year:
2002

Citing 16
Cited 41

Integer and combinatorial optimization

Integer and combinatorial optimization
Resource allocation problems: algorithmic approaches

Resource allocation problems: algorithmic approaches
Life, death, and lawfulness on the electronic frontier

Proceedings of the ACM SIGCHI Conference on Human factors in computing systems
The art of computer programming, volume 2 (3rd ed.): seminumerical algorithms

The art of computer programming, volume 2 (3rd ed.): seminumerical algorithms
Towards a better understanding of Web resources and server responses for improved caching

WWW '99 Proceedings of the eighth international conference on World Wide Web
A Fast Selection Algorithm and the Problem of Optimum Distribution of Effort

Journal of the ACM (JACM)
Accessibility of information on the Web

intelligence
Synchronizing a database to improve freshness

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Web traffic modeling and Web server performance analysis

ACM SIGMETRICS Performance Evaluation Review
The content and access dynamics of a busy Web site: findings and implications

Proceedings of the conference on Applications, Technologies, Architectures, and Protocols for Computer Communication
Searching the Web

ACM Transactions on Internet Technology (TOIT)
Scheduling in Computer and Manufacturing Systems

Scheduling in Computer and Manufacturing Systems
Analysis and characterization of large-scale Web server access patterns and performance

World Wide Web
Efficiently serving dynamic data at highly accessed web sites

IEEE/ACM Transactions on Networking (TON)
Rate of change and other metrics: a live study of the world wide web

USITS'97 Proceedings of the USENIX Symposium on Internet Technologies and Systems on USENIX Symposium on Internet Technologies and Systems
Scheduling: Theory, Algorithms, and Systems

Scheduling: Theory, Algorithms, and Systems

Monitoring the dynamic web to respond to continuous queries

WWW '03 Proceedings of the 12th international conference on World Wide Web
Optimizing result prefetching in web search engines with segmented indices

ACM Transactions on Internet Technology (TOIT)
Impact of search engines on page popularity

Proceedings of the 13th international conference on World Wide Web
Sic transit gloria telae: towards an understanding of the web's decay

Proceedings of the 13th international conference on World Wide Web
Competitive caching of query results in search engines

Theoretical Computer Science - Special issue: Online algorithms in memoriam, Steve Seiden
User-centric Web crawling

WWW '05 Proceedings of the 14th international conference on World Wide Web
Database-inspired search

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Evaluation of crawling policies for a web-repository crawler

Proceedings of the seventeenth conference on Hypertext and hypermedia
Automated gathering of Web information: An in-depth examination of agents interacting with search engines

ACM Transactions on Internet Technology (TOIT)
Distributed Context Retrieval and Consistency Control in Pervasive Computing

Journal of Network and Systems Management
The discoverability of the web

Proceedings of the 16th international conference on World Wide Web
Efficient Monitoring Algorithm for Fast News Alerts

IEEE Transactions on Knowledge and Data Engineering
Modeling and managing changes in text databases

ACM Transactions on Database Systems (TODS)
WIC: a general-purpose algorithm for monitoring web information sources

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Designing clustering-based web crawling policies for search engine crawlers

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
RankMass crawler: a crawler with high personalized pagerank coverage guarantee

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Crawl ordering by search impact

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Recrawl scheduling based on information longevity

Proceedings of the 17th international conference on World Wide Web
Enhancing digital libraries using missing content analysis

Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
A Hierarchy of Twofold Resource Allocation Automata Supporting Optimal Web Polling

IEA/AIE '08 Proceedings of the 21st international conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems: New Frontiers in Applied Artificial Intelligence
Parallel crawler architecture and web page change detection

WSEAS Transactions on Computers
Sitemaps: above and beyond the crawl of duty

Proceedings of the 18th international conference on World wide web
Incorporating site-level knowledge for incremental crawling of web forums: a list-wise strategy

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
A practical method for browsing a relational database using a standard search engine

Integrated Computer-Aided Engineering - Selected papers from the IEEE Conference on Information Reuse and Integration (IRI), July 13-15, 2008
Web Crawling

Foundations and Trends in Information Retrieval
Optimising context data dissemination and storage in distributed pervasive computing systems

Pervasive and Mobile Computing
Efficiently detecting webpage updates using samples

ICWE'07 Proceedings of the 7th international conference on Web engineering
On trade-offs in event delivery systems

Proceedings of the Fourth ACM International Conference on Distributed Event-Based Systems
Optimal sampling for estimation with constrained resources using a learning automaton-based solution for the nonlinear fractional knapsack problem

Applied Intelligence
Clustering-based incremental web crawling

ACM Transactions on Information Systems (TOIS)
Scale-adaptable recrawl strategies for DHT-based distributed web crawling system

NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Archiving the web using page changes patterns: a case study

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Caché: caching location-enhanced content to improve user privacy

MobiSys '11 Proceedings of the 9th international conference on Mobile systems, applications, and services
Learning automata-based solutions to the optimal web polling problem modelled as a nonlinear fractional knapsack problem

Engineering Applications of Artificial Intelligence
Discovering URLs through user feedback

Proceedings of the 20th ACM international conference on Information and knowledge management
Decomposition-Based optimization of reload strategies in the world wide web

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
Learning the grammar of distant change in the world-wide web

AI'04 Proceedings of the 17th Australian joint conference on Advances in Artificial Intelligence
PageRank on an evolving graph

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Serial position effects of clicking behavior on result pages returned by search engines

Proceedings of the 21st ACM international conference on Information and knowledge management
A self-adaptive strategy for web crawler in in-site search

WISM'12 Proceedings of the 2012 international conference on Web Information Systems and Mining
Predicting content change on the web

Proceedings of the sixth ACM international conference on Web search and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Web Search Engines employ multiple so-called crawlers to maintain local copies of web pages. But these web pages are frequently updated by their owners, and therefore the crawlers must regularly revisit the web pages to maintain the freshness of their local copies. In this paper, we propose a two-part scheme to optimize this crawling process. One goal might be the minimization of the average level of staleness over all web pages, and the scheme we propose can solve this problem. Alternatively, the same basic scheme could be used to minimize a possibly more important search engine embarrassment level metric: The frequency with which a client makes a search engine query and then clicks on a returned url only to find that the result is incorrect. The first part our scheme determines the (nearly) optimal crawling frequencies, as well as the theoretically optimal times to crawl each web page. It does so within an extremely general stochastic framework, one which supports a wide range of complex update patterns found in practice. It uses techniques from probability theory and the theory of resource allocation problems which are highly computationally efficient -- crucial for practicality because the size of the problem in the web environment is immense. The second part employs these crawling frequencies and ideal crawl times as input, and creates an optimal achievable schedule for the crawlers. Our solution, based on network flow theory, is exact as well as highly efficient. An analysis of the update patterns from a highly accessed and highly dynamic web site is used to gain some insights into the properties of page updates in practice. Then, based on this analysis, we perform a set of detailed simulation experiments to demonstrate the quality and speed of our approach.