Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
Mirror, mirror on the Web: a study of host pairs with replicated content
WWW '99 Proceedings of the eighth international conference on World Wide Web
Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Breadth-first crawling yields high-quality pages
Proceedings of the 10th international conference on World Wide Web
The Evolution of the Web and Implications for an Incremental Crawler
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
A large-scale study of the evolution of web pages
WWW '03 Proceedings of the 12th international conference on World Wide Web
Statistical Identification of Encrypted Web Browsing Traffic
SP '02 Proceedings of the 2002 IEEE Symposium on Security and Privacy
Rate of change and other metrics: a live study of the world wide web
USITS'97 Proceedings of the USENIX Symposium on Internet Technologies and Systems on USENIX Symposium on Internet Technologies and Systems
Evolution of web site design patterns
ACM Transactions on Information Systems (TOIS)
BuzzRank … and the trend is your friend
Proceedings of the 15th international conference on World Wide Web
Computer Networks: The International Journal of Computer and Telecommunications Networking
Dynamic test collections: measuring search effectiveness on the live web
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Estimating the evolution of categorized web page populations
ICWE '06 Workshop proceedings of the sixth international conference on Web engineering
An approach to detection ontology changes
ICWE '06 Workshop proceedings of the sixth international conference on Web engineering
Structure and evolution of online social networks
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
The discoverability of the web
Proceedings of the 16th international conference on World Wide Web
Foundations and Trends in Web Science
Using neighbors to date web documents
Proceedings of the 9th annual ACM international workshop on Web information and data management
Longitudinal trends in academic web links
Journal of Information Science
Data & Knowledge Engineering
Microscopic evolution of social networks
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Spamming botnets: signatures and characteristics
Proceedings of the ACM SIGCOMM 2008 conference on Data communication
A Quantitative Evaluation of Dissemination-Time Preservation Metadata
ECDL '08 Proceedings of the 12th European conference on Research and Advanced Technology for Digital Libraries
Characterization of the evolution of a news Web site
Journal of Systems and Software
A three-year study on the freshness of web search engine databases
Journal of Information Science
Sitemaps: above and beyond the crawl of duty
Proceedings of the 18th international conference on World wide web
A Study of the Impact of Index Updates on Distributed Query Processing for Web Search
ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Proceedings of the 4th Annual International Conference on Wireless Internet
A method for measuring the evolution of a topic on the Web: The case of “informetrics”
Journal of the American Society for Information Science and Technology
Proceedings of the 18th ACM conference on Information and knowledge management
Computer Networks: The International Journal of Computer and Telecommunications Networking
A capture-recapture sampling standardization for improving Internet meta-search
Computer Standards & Interfaces
Estimating the size and evolution of categorised topics in web directories
Web Intelligence and Agent Systems
Computer Networks: The International Journal of Computer and Telecommunications Networking
SPIRE'07 Proceedings of the 14th international conference on String processing and information retrieval
Coverage and timeliness analysis of search engines with webpage monitoring results
WISE'07 Proceedings of the 8th international conference on Web information systems engineering
Graph structure of the Korea web
DASFAA'07 Proceedings of the 12th international conference on Database systems for advanced applications
News page discovery policy for instant crawlers
AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
A Framework for Large-Scale Detection of Web Site Defacements
ACM Transactions on Internet Technology (TOIT)
Term frequency dynamics in collaborative articles
Proceedings of the 10th ACM symposium on Document engineering
A heuristic-based feature selection method for clustering spam emails
ICONIP'10 Proceedings of the 17th international conference on Neural information processing: theory and algorithms - Volume Part I
How is the Semantic Web evolving? A dynamic social network perspective
Computers in Human Behavior
Discovering URLs through user feedback
Proceedings of the 20th ACM international conference on Information and knowledge management
A precise metric for measuring how much web pages change
DASFAA'06 Proceedings of the 11th international conference on Database Systems for Advanced Applications
Effective criteria for web page changes
APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
World Wide Web
Exploring temporal evidence in web information retrieval
FDIA'07 Proceedings of the 1st BCS IRSG conference on Future Directions in Information Access
Teaching of web information retrieval: web first or IR first?
TLIR'07 Proceedings of the First international conference on Teaching and Learning of Information Retrieval
Re-wiring activity of malicious networks
PAM'12 Proceedings of the 13th international conference on Passive and Active Measurement
An evaluation of caching policies for memento timemaps
Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
Archival HTTP redirection retrieval policies
Proceedings of the 22nd international conference on World Wide Web companion
CALA: An unsupervised URL-based web page classification system
Knowledge-Based Systems
Hi-index | 0.00 |
How fast does the Web change? Does most of the content remain unchanged once it has been authored, or are the documents continuously updated? Do pages change a little or a lot? Is the extent of change correlated to any other property of the page? All of these questions are of interest to those who mine the Web, including all the popular search engines, but few studies have been performed to date to answer them.One notable exception is a study by Cho and Garcia-Molina, who crawled a set of 720 000 pages on a daily basis over 4 months, and counted pages as having changed if their MD5 checksum changed. They found that 40% of all Web pages in their set changed within a week, and 23% of those pages that fell into the .com domain changed daily.This paper expands on Cho and Garcia-Molina's study, both in terms of coverage and in terms of sensitivity to change. We crawled a set of 150 836 209 HTML pages once every week, over a span of 11 weeks. For each page, we recorded a checksum of the page, and a feature vector of the words on the page, plus various other data such as the page length, the HTTP status code, etc. Moreover, we pseudorandomly selected 0.1% of all of our URLs, and saved the full text of each download of the corresponding pages.After completion of the crawl, we analyzed the degree of change of each page, and investigated which factors are correlated with change intensity. We found that the average degree of change varies widely across top-level domains, and that larger pages change more often and more severely than smaller ones.This paper describes the crawl and the data transformations we performed on the logs, and presents some statistical observations on the degree of change of different classes of pages.