Obtaining Language Models of Web Collections Using Query-Based Sampling Techniques
HICSS '02 Proceedings of the 35th Annual Hawaii International Conference on System Sciences (HICSS'02)-Volume 3 - Volume 3
The indexable web is more than 11.5 billion pages
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Agreeing to disagree: search engines and their public interfaces
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Random sampling from a search engine's index
Journal of the ACM (JACM)
Design and selection criteria for a national web archive
ECDL'06 Proceedings of the 10th European conference on Research and Advanced Technology for Digital Libraries
Losing my revolution: how many resources shared on social media have been lost?
TPDL'12 Proceedings of the Second international conference on Theory and Practice of Digital Libraries
An evaluation of caching policies for memento timemaps
Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
Archival HTTP redirection retrieval policies
Proceedings of the 22nd international conference on World Wide Web companion
Carbon dating the web: estimating the age of web resources
Proceedings of the 22nd international conference on World Wide Web companion
Hi-index | 0.00 |
The Memento Project's archive access additions to HTTP have enabled development of new web archive access user interfaces. After experiencing this web time travel, the in- evitable question that comes to mind is "How much of the Web is archived?" This question is studied by approximating the Web via sampling URIs from DMOZ, Delicious, Bitly, and search engine indexes and measuring number of archive copies available in various public web archives. The results indicate that 35%-90% of URIs have at least one archived copy, 17%-49% have two to five copies, 1%-8% have six to ten copies, and 8%-63% at least ten copies. The number of URI copies varies as a function of time, but only 14.6-31.3% of URIs are archived more than once per month.