Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
Mirror, mirror on the Web: a study of host pairs with replicated content
WWW '99 Proceedings of the eighth international conference on World Wide Web
Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Breadth-first crawling yields high-quality pages
Proceedings of the 10th international conference on World Wide Web
The Evolution of the Web and Implications for an Incremental Crawler
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Statistical Identification of Encrypted Web Browsing Traffic
SP '02 Proceedings of the 2002 IEEE Symposium on Security and Privacy
Rate of change and other metrics: a live study of the world wide web
USITS'97 Proceedings of the USENIX Symposium on Internet Technologies and Systems on USENIX Symposium on Internet Technologies and Systems
Efficient URL caching for world wide web crawling
WWW '03 Proceedings of the 12th international conference on World Wide Web
The connectivity sonar: detecting site functionality by structural patterns
Proceedings of the fourteenth ACM conference on Hypertext and hypermedia
What's new on the web?: the evolution of the web from a search engine perspective
Proceedings of the 13th international conference on World Wide Web
Impact of search engines on page popularity
Proceedings of the 13th international conference on World Wide Web
Sic transit gloria telae: towards an understanding of the web's decay
Proceedings of the 13th international conference on World Wide Web
A large-scale study of the evolution of web pages
Software—Practice & Experience - Special issue: Web technologies
Managing distributed collections: evaluating web page changes, movement, and replacement
Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
Scaling IR-system evaluation using term relevance sets
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Performance and cost tradeoffs in Web search
ADC '04 Proceedings of the 15th Australasian database conference - Volume 27
Site-to-site (s2s) searching using the p2p framework with cgi
Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters
Spam, damn spam, and statistics: using statistical analysis to locate spam web pages
Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004
The Evolution of Link-Attributes for Pages and Its Implications on Web Crawling
WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
Journal of the American Society for Information Science and Technology
User Centric Walk: An Integrated Approach for Modeling the Browsing Behavior of Users on the Web
ANSS '05 Proceedings of the 38th annual Symposium on Simulation
Modeling and Managing Content Changes in Text Databases
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Improving Web search efficiency via a locality based static pruning method
WWW '05 Proceedings of the 14th international conference on World Wide Web
Trend detection through temporal link analysis
Journal of the American Society for Information Science and Technology - Special issue: Webometrics
Detecting phrase-level duplication on the world wide web
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Characterizing a national community web
ACM Transactions on Internet Technology (TOIT)
Evolution of web site design patterns
ACM Transactions on Information Systems (TOIS)
Interpreting social science link analysis research: A theoretical framework
Journal of the American Society for Information Science and Technology
What's really new on the web?: identifying new pages from a series of unstable web snapshots
Proceedings of the 15th international conference on World Wide Web
Detecting semantic cloaking on the web
Proceedings of the 15th international conference on World Wide Web
Managing duplicates in a web archive
Proceedings of the 2006 ACM symposium on Applied computing
Modelling information persistence on the web
ICWE '06 Proceedings of the 6th international conference on Web engineering
Web dynamics and their ramifications for the development of web search engines
Computer Networks: The International Journal of Computer and Telecommunications Networking - Web dynamics
Implementation and evaluation of a quality-based search engine
Proceedings of the seventeenth conference on Hypertext and hypermedia
Journey to the past: proposal of a framework for past web browser
Proceedings of the seventeenth conference on Hypertext and hypermedia
The portrait of a common HTML web page
Proceedings of the 2006 ACM symposium on Document engineering
Coarse-grained classification of web sites by their structural properties
WIDM '06 Proceedings of the 8th annual ACM international workshop on Web information and data management
Lazy preservation: reconstructing websites by crawling the crawlers
WIDM '06 Proceedings of the 8th annual ACM international workshop on Web information and data management
Eigen-trend: trend analysis in the blogosphere based on singular value decompositions
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Temporal multi-page summarization
Web Intelligence and Agent Systems
Communications of the ACM - ACM at sixty: a look back in time
P-TAG: large scale automatic generation of personalized annotation tags for the web
Proceedings of the 16th international conference on World Wide Web
Designing efficient sampling techniques to detect webpage updates
Proceedings of the 16th international conference on World Wide Web
Factors affecting website reconstruction from the web infrastructure
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Can social bookmarking enhance search in the web?
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Modeling and managing changes in text databases
ACM Transactions on Database Systems (TODS)
Essential deduplication functions for transactional databases in law firms
Proceedings of the 11th international conference on Artificial intelligence and law
Analysis of online video search and sharing
Proceedings of the eighteenth conference on Hypertext and hypermedia
WIC: a general-purpose algorithm for monitoring web information sources
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Designing clustering-based web crawling policies for search engine crawlers
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Not quite the average: An empirical study of Web use
ACM Transactions on the Web (TWEB)
Modelling and simulation of the web graph: evaluating an exponential growth copying model
International Journal of Web Engineering and Technology
A Top-K-based cache model for deep web query
Proceedings of the 2nd international conference on Scalable information systems
Recrawl scheduling based on information longevity
Proceedings of the 17th international conference on World Wide Web
Estimating the Change of Web Pages
ICCS '07 Proceedings of the 7th international conference on Computational Science, Part III: ICCS 2007
Representing and Quantifying Rank - Change for the Web Graph
Algorithms and Models for the Web-Graph
Zoetrope: interacting with the ephemeral web
Proceedings of the 21st annual ACM symposium on User interface software and technology
Parallel crawler architecture and web page change detection
WSEAS Transactions on Computers
The web changes everything: understanding the dynamics of web content
Proceedings of the Second ACM International Conference on Web Search and Data Mining
Resonance on the web: web dynamics and revisitation patterns
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Workload Characterization of a Large Systems Conference Web Server
CNSR '09 Proceedings of the 2009 Seventh Annual Communication Networks and Services Research Conference
Investigation of the accuracy of search engine hit counts
Journal of Information Science
The impact of crawl policy on web search effectiveness
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Why web sites are lost (and how they're sometimes found)
Communications of the ACM - Scratch Programming for All
Changing how people view changes on the web
Proceedings of the 22nd annual ACM symposium on User interface software and technology
Electronic Notes in Theoretical Computer Science (ENTCS)
Leveraging temporal dynamics of document content in relevance ranking
Proceedings of the third ACM international conference on Web search and data mining
Foundations and Trends in Information Retrieval
Reactive information foraging for evolving goals
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
A longitudinal study of how highlighting web content change affects people's web interactions
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
The adaptive web
Efficiently detecting webpage updates using samples
ICWE'07 Proceedings of the 7th international conference on Web engineering
Proceedings of the 19th international conference on World wide web
Clustering-based incremental web crawling
ACM Transactions on Information Systems (TOIS)
Building a dynamic classifier for large text data collections
ADC '10 Proceedings of the Twenty-First Australasian Conference on Database Technologies - Volume 104
Understanding temporal query dynamics
Proceedings of the fourth ACM international conference on Web search and data mining
Spam detection in online classified advertisements
Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality
PRESIDIO: A Framework for Efficient Archival Data Storage
ACM Transactions on Storage (TOS)
Theory and applications of b-bit minwise hashing
Communications of the ACM
An analysis of time-instability in web search results
ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
Resources for Turkish morphological processing
Language Resources and Evaluation
Temporal index sharding for space-time efficiency in archive search
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Timestamp-based result cache invalidation for web search engines
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Evolution of web search results within years
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Why is web search so hard... to evaluate?
Journal of Web Engineering
On the evolution of clusters of near-duplicate web pages
Journal of Web Engineering
ACM Transactions on the Web (TWEB)
Understanding website complexity: measurements, metrics, and implications
Proceedings of the 2011 ACM SIGCOMM conference on Internet measurement conference
An empirical study on the change of web pages
APWeb'05 Proceedings of the 7th Asia-Pacific web conference on Web Technologies Research and Development
A systematic study of parameter correlations in large scale duplicate document detection
PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Thwarting the nigritude ultramarine: learning to identify link spam
ECML'05 Proceedings of the 16th European conference on Machine Learning
Personalized detection of fresh content and temporal annotation for improved page revisiting
DEXA'06 Proceedings of the 17th international conference on Database and Expert Systems Applications
Keeping keywords fresh: a BM25 variation for personalized keyword extraction
Proceedings of the 2nd Temporal Web Analytics Workshop
Clustering and load balancing optimization for redundant content removal
Proceedings of the 21st international conference companion on World Wide Web
XCC: change control of XML documents
Computer Science - Research and Development
Towards real intelligent web exploration
APWeb'12 Proceedings of the 14th Asia-Pacific international conference on Web Technologies and Applications
Adaptive time-to-live strategies for query result caching in web search engines
ECIR'12 Proceedings of the 34th European conference on Advances in Information Retrieval
Content-based analysis to detect Arabic web spam
Journal of Information Science
CAMEUD: clustering approach for mining evolving usage data
Proceedings of the Ninth International Workshop on Information Integration on the Web
Evolution of a location-based online social network: analysis and models
Proceedings of the 2012 ACM conference on Internet measurement conference
Fast near neighbor search in high-dimensional binary data
ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I
Reachability in graph timelines
Proceedings of the 4th conference on Innovations in Theoretical Computer Science
Predicting content change on the web
Proceedings of the sixth ACM international conference on Web search and data mining
Extending sitemaps for ResourceSync
Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
Delta: automatic identification of unknown web-based infection campaigns
Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security
b-bit minwise hashing in practice
Proceedings of the 5th Asia-Pacific Symposium on Internetware
Hi-index | 0.02 |
How fast does the web change? Does most of the content remain unchanged once it has been authored, or are the documents continuously updated? Do pages change a little or a lot? Is the extent of change correlated to any other property of the page? All of these questions are of interest to those who mine the web, including all the popular search engines, but few studies have been performed to date to answer them.One notable exception is a study by Cho and Garcia-Molina, who crawled a set of 720,000 pages on a daily basis over four months, and counted pages as having changed if their MD5 checksum changed. They found that 40% of all web pages in their set changed within a week, and 23% of those pages that fell into the .com domain changed daily.This paper expands on Cho and Garcia-Molina's study, both in terms of coverage and in terms of sensitivity to change. We crawled a set of 150,836,209 HTML pages once every week, over a span of 11 weeks. For each page, we recorded a checksum of the page, and a feature vector of the words on the page, plus various other data such as the page length, the HTTP status code, etc. Moreover, we pseudo-randomly selected 0.1% of all of our URLs, and saved the full text of each download of the corresponding pages.After completion of the crawl, we analyzed the degree of change of each page, and investigated which factors are correlated with change intensity. We found that the average degree of change varies widely across top-level domains, and that larger pages change more often and more severely than smaller ones.This paper describes the crawl and the data transformations we performed on the logs, and presents some statistical observations on the degree of change of different classes of pages.