User-centric Web crawling

Authors:
Sandeep Pandey;Christopher Olston
Affiliations:
Carnegie Mellon University, Pittsburgh, PA;Carnegie Mellon University, Pittsburgh, PA
Venue:
WWW '05 Proceedings of the 14th international conference on World Wide Web
Year:
2005

Citing 13
Cited 36

Copy detection mechanisms for digital documents

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Authoritative sources in a hyperlinked environment

Journal of the ACM (JACM)
Synchronizing a database to improve freshness

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
An adaptive model for optimizing performance of an incremental web crawler

Proceedings of the 10th international conference on World Wide Web
Optimal crawling strategies for web search engines

Proceedings of the 11th international conference on World Wide Web
The Evolution of the Web and Implications for an Incremental Crawler

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Optimizing search engines using clickthrough data

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Predictive caching and prefetching of query results in search engines

WWW '03 Proceedings of the 12th international conference on World Wide Web
What's new on the web?: the evolution of the web from a search engine perspective

Proceedings of the 13th international conference on World Wide Web
Impact of search engines on page popularity

Proceedings of the 13th international conference on World Wide Web

Looking at both the present and the past to efficiently update replicas of web content

Proceedings of the 7th annual ACM international workshop on Web information and data management
Beyond PageRank: machine learning for static ranking

Proceedings of the 15th international conference on World Wide Web
Detecting near-duplicates for web crawling

Proceedings of the 16th international conference on World Wide Web
The discoverability of the web

Proceedings of the 16th international conference on World Wide Web
Efficient Monitoring Algorithm for Fast News Alerts

IEEE Transactions on Knowledge and Data Engineering
Modeling and managing changes in text databases

ACM Transactions on Database Systems (TODS)
Designing clustering-based web crawling policies for search engine crawlers

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
RankMass crawler: a crawler with high personalized pagerank coverage guarantee

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Crawl ordering by search impact

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Recrawl scheduling based on information longevity

Proceedings of the 17th international conference on World Wide Web
iRobot: an intelligent crawler for web forums

Proceedings of the 17th international conference on World Wide Web
Enhancing digital libraries using missing content analysis

Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Exploring traversal strategy for web forum crawling

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Maintaining dynamic channel profiles on the web

Proceedings of the VLDB Endowment
Sitemaps: above and beyond the crawl of duty

Proceedings of the 18th international conference on World wide web
Measuring the Search Effectiveness of a Breadth-First Crawl

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Incorporating site-level knowledge for incremental crawling of web forums: a list-wise strategy

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
FICA: A novel intelligent crawling algorithm based on reinforcement learning

Web Intelligence and Agent Systems
NEAR-Miner: mining evolution associations of web site directories for efficient maintenance of web archives

Proceedings of the VLDB Endowment
Web Crawling

Foundations and Trends in Information Retrieval
Efficiently detecting webpage updates using samples

ICWE'07 Proceedings of the 7th international conference on Web engineering
Mining Query Logs: Turning Search Usage Data into Knowledge

Foundations and Trends in Information Retrieval
Clustering-based incremental web crawling

ACM Transactions on Information Systems (TOIS)
Scale-adaptable recrawl strategies for DHT-based distributed web crawling system

NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Fixing the threshold for effective detection of near duplicate web documents in web crawling

ADMA'10 Proceedings of the 6th international conference on Advanced data mining and applications: Part I
Best-effort refresh strategies for content-based RSS feed aggregation

WISE'10 Proceedings of the 11th international conference on Web information systems engineering
Archiving the web using page changes patterns: a case study

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Discovering URLs through user feedback

Proceedings of the 20th ACM international conference on Information and knowledge management
User browsing behavior-driven web crawling

Proceedings of the 20th ACM international conference on Information and knowledge management
A novel crawling algorithm for web pages

AIRS'11 Proceedings of the 7th Asia conference on Information Retrieval Technology
PageRank on an evolving graph

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Online change estimation models for dynamic web resources: a case-study of RSS feed refresh strategies

ICWE'12 Proceedings of the 12th international conference on Web Engineering
Predicting content change on the web

Proceedings of the sixth ACM international conference on Web search and data mining
Timely crawling of high-quality ephemeral new content

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Slash-based relevance propagation model for topic distillation

Journal of Web Engineering
Adscape: harvesting and analyzing online display ads

Proceedings of the 23rd international conference on World wide web

Quantified Score

Hi-index	0.00

Visualization

Abstract

Search engines are the primary gateways of information access on the Web today. Behind the scenes, search engines crawl the Web to populate a local indexed repository of Web pages, used to answer user search queries. In an aggregate sense, the Web is very dynamic, causing any repository of Web pages to become out of date over time, which in turn causes query answer quality to degrade. Given the considerable size, dynamicity, and degree of autonomy of the Web as a whole, it is not feasible for a search engine to maintain its repository exactly synchronized with the Web.In this paper we study how to schedule Web pages for selective (re)downloading into a search engine repository. The scheduling objective is to maximize the quality of the user experience for those who query the search engine. We begin with a quantitative characterization of the way in which the discrepancy between the content of the repository and the current content of the live Web impacts the quality of the user experience. This characterization leads to a user-centric metric of the quality of a search engine's local repository. We use this metric to derive a policy for scheduling Web page (re)downloading that is driven by search engine usage and free of exterior tuning parameters. We then focus on the important subproblem of scheduling refreshing of Web pages already present in the repository, and show how to compute the priorities efficiently. We provide extensive empirical comparisons of our user-centric method against prior Web page refresh strategies, using real Web data. Our results demonstrate that our method requires far fewer resources to maintain same search engine quality level for users, leaving substantially more resources available for incorporating new Web pages into the search repository.