Recrawl scheduling based on information longevity

Authors:
Christopher Olston;Sandeep Pandey
Affiliations:
Yahoo! Research, Santa Clara, CA, USA;Carnegie Mellon University, Pittsburgh, PA, USA
Venue:
Proceedings of the 17th international conference on World Wide Web
Year:
2008

Citing 11
Cited 33

Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
An adaptive model for optimizing performance of an incremental web crawler

Proceedings of the 10th international conference on World Wide Web
Optimal crawling strategies for web search engines

Proceedings of the 11th international conference on World Wide Web
Best-effort cache synchronization with source cooperation

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
The Evolution of the Web and Implications for an Incremental Crawler

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
A large-scale study of the evolution of web pages

WWW '03 Proceedings of the 12th international conference on World Wide Web
Estimating frequency of change

ACM Transactions on Internet Technology (TOIT)
Effective page refresh policies for Web crawlers

ACM Transactions on Database Systems (TODS)
What's new on the web?: the evolution of the web from a search engine perspective

Proceedings of the 13th international conference on World Wide Web
Sic transit gloria telae: towards an understanding of the web's decay

Proceedings of the 13th international conference on World Wide Web
User-centric Web crawling

WWW '05 Proceedings of the 14th international conference on World Wide Web

On the feasibility of geographically distributed web crawling

Proceedings of the 3rd international conference on Scalable information systems
The web changes everything: understanding the dynamics of web content

Proceedings of the Second ACM International Conference on Web Search and Data Mining
Sitemaps: above and beyond the crawl of duty

Proceedings of the 18th international conference on World wide web
Data quality in web archiving

Proceedings of the 3rd workshop on Information credibility on the web
Mining rich session context to improve web search

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Incorporating site-level knowledge for incremental crawling of web forums: a list-wise strategy

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
A method for measuring the evolution of a topic on the Web: The case of “informetrics”

Journal of the American Society for Information Science and Technology
Development of a large-scale web crawler and search engine infrastructure

Proceedings of the 3rd International Universal Communication Symposium
SHARC: framework for quality-conscious web archiving

Proceedings of the VLDB Endowment
NEAR-Miner: mining evolution associations of web site directories for efficient maintenance of web archives

Proceedings of the VLDB Endowment
Leveraging temporal dynamics of document content in relevance ranking

Proceedings of the third ACM international conference on Web search and data mining
Web Crawling

Foundations and Trends in Information Retrieval
Using visual pages analysis for optimizing web archiving

Proceedings of the 2010 EDBT/ICDT Workshops
Clustering-based incremental web crawling

ACM Transactions on Information Systems (TOIS)
Using the past to score the present: extending term weighting models through revision history analysis

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Scale-adaptable recrawl strategies for DHT-based distributed web crawling system

NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Understanding temporal query dynamics

Proceedings of the fourth ACM international conference on Web search and data mining
A word at a time: computing word relatedness using temporal semantic analysis

Proceedings of the 20th international conference on World wide web
The SHARC framework for data quality in Web archiving

The VLDB Journal — The International Journal on Very Large Data Bases
Best-effort refresh strategies for content-based RSS feed aggregation

WISE'10 Proceedings of the 11th international conference on Web information systems engineering
Detecting and exploiting stability in evolving heterogeneous information spaces

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Archiving the web using page changes patterns: a case study

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Caché: caching location-enhanced content to improve user privacy

MobiSys '11 Proceedings of the 9th international conference on Mobile systems, applications, and services
Improving the quality of web archives through the importance of changes

DEXA'11 Proceedings of the 22nd international conference on Database and expert systems applications - Volume Part I
Coherence-oriented crawling and navigation using patterns for web archives

TPDL'11 Proceedings of the 15th international conference on Theory and practice of digital libraries: research and advanced technology for digital libraries
ClickRank: Learning Session-Context Models to Enrich Web Search Ranking

ACM Transactions on the Web (TWEB)
Temporal shingling for version identification in web archives

ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
PageRank on an evolving graph

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Online change estimation models for dynamic web resources: a case-study of RSS feed refresh strategies

ICWE'12 Proceedings of the 12th international conference on Web Engineering
Archiving the relaxed consistency web

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
A pattern-based selective recrawling approach for object-level vertical search

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
CUVIM: extracting fresh information from social network

WAIM'13 Proceedings of the 14th international conference on Web-Age Information Management
Identifying user sessions from web server logs with integer programming

Intelligent Data Analysis - Business Analytics and Intelligent Optimization

Quantified Score

Hi-index	0.00

Visualization

Abstract

It is crucial for a web crawler to distinguish between ephemeral and persistent content. Ephemeral content (e.g., quote of the day) is usually not worth crawling, because by the time it reaches the index it is no longer representative of the web page from which it was acquired. On the other hand, content that persists across multiple page updates (e.g., recent blog postings) may be worth acquiring, because it matches the page's true content for a sustained period of time. In this paper we characterize the longevity of information found on the web, via both empirical measurements and a generative model that coincides with these measurements. We then develop new recrawl scheduling policies that take longevity into account. As we show via experiments over real web data, our policies obtain better freshness at lower cost, compared with previous approaches.