Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
An adaptive model for optimizing performance of an incremental web crawler
Proceedings of the 10th international conference on World Wide Web
Optimal crawling strategies for web search engines
Proceedings of the 11th international conference on World Wide Web
Best-effort cache synchronization with source cooperation
Proceedings of the 2002 ACM SIGMOD international conference on Management of data
The Evolution of the Web and Implications for an Incremental Crawler
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
A large-scale study of the evolution of web pages
WWW '03 Proceedings of the 12th international conference on World Wide Web
Estimating frequency of change
ACM Transactions on Internet Technology (TOIT)
Effective page refresh policies for Web crawlers
ACM Transactions on Database Systems (TODS)
What's new on the web?: the evolution of the web from a search engine perspective
Proceedings of the 13th international conference on World Wide Web
Sic transit gloria telae: towards an understanding of the web's decay
Proceedings of the 13th international conference on World Wide Web
WWW '05 Proceedings of the 14th international conference on World Wide Web
On the feasibility of geographically distributed web crawling
Proceedings of the 3rd international conference on Scalable information systems
The web changes everything: understanding the dynamics of web content
Proceedings of the Second ACM International Conference on Web Search and Data Mining
Sitemaps: above and beyond the crawl of duty
Proceedings of the 18th international conference on World wide web
Proceedings of the 3rd workshop on Information credibility on the web
Mining rich session context to improve web search
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Incorporating site-level knowledge for incremental crawling of web forums: a list-wise strategy
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
A method for measuring the evolution of a topic on the Web: The case of “informetrics”
Journal of the American Society for Information Science and Technology
Development of a large-scale web crawler and search engine infrastructure
Proceedings of the 3rd International Universal Communication Symposium
SHARC: framework for quality-conscious web archiving
Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment
Leveraging temporal dynamics of document content in relevance ranking
Proceedings of the third ACM international conference on Web search and data mining
Foundations and Trends in Information Retrieval
Using visual pages analysis for optimizing web archiving
Proceedings of the 2010 EDBT/ICDT Workshops
Clustering-based incremental web crawling
ACM Transactions on Information Systems (TOIS)
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Scale-adaptable recrawl strategies for DHT-based distributed web crawling system
NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Understanding temporal query dynamics
Proceedings of the fourth ACM international conference on Web search and data mining
A word at a time: computing word relatedness using temporal semantic analysis
Proceedings of the 20th international conference on World wide web
The SHARC framework for data quality in Web archiving
The VLDB Journal — The International Journal on Very Large Data Bases
Best-effort refresh strategies for content-based RSS feed aggregation
WISE'10 Proceedings of the 11th international conference on Web information systems engineering
Detecting and exploiting stability in evolving heterogeneous information spaces
Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Archiving the web using page changes patterns: a case study
Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Caché: caching location-enhanced content to improve user privacy
MobiSys '11 Proceedings of the 9th international conference on Mobile systems, applications, and services
Improving the quality of web archives through the importance of changes
DEXA'11 Proceedings of the 22nd international conference on Database and expert systems applications - Volume Part I
Coherence-oriented crawling and navigation using patterns for web archives
TPDL'11 Proceedings of the 15th international conference on Theory and practice of digital libraries: research and advanced technology for digital libraries
ClickRank: Learning Session-Context Models to Enrich Web Search Ranking
ACM Transactions on the Web (TWEB)
Temporal shingling for version identification in web archives
ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
ICWE'12 Proceedings of the 12th international conference on Web Engineering
Archiving the relaxed consistency web
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
A pattern-based selective recrawling approach for object-level vertical search
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
CUVIM: extracting fresh information from social network
WAIM'13 Proceedings of the 14th international conference on Web-Age Information Management
Identifying user sessions from web server logs with integer programming
Intelligent Data Analysis - Business Analytics and Intelligent Optimization
Hi-index | 0.00 |
It is crucial for a web crawler to distinguish between ephemeral and persistent content. Ephemeral content (e.g., quote of the day) is usually not worth crawling, because by the time it reaches the index it is no longer representative of the web page from which it was acquired. On the other hand, content that persists across multiple page updates (e.g., recent blog postings) may be worth acquiring, because it matches the page's true content for a sustained period of time. In this paper we characterize the longevity of information found on the web, via both empirical measurements and a generative model that coincides with these measurements. We then develop new recrawl scheduling policies that take longevity into account. As we show via experiments over real web data, our policies obtain better freshness at lower cost, compared with previous approaches.