Archival storage for digital libraries
Proceedings of the third ACM conference on Digital libraries
The anatomy of a large-scale hypertextual Web search engine
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Mirror, mirror on the Web: a study of host pairs with replicated content
WWW '99 Proceedings of the eighth international conference on World Wide Web
Aliasing on the world wide web: prevalence and performance implications
Proceedings of the 11th international conference on World Wide Web
Signature extraction for overlap detection in documents
ACSC '02 Proceedings of the twenty-fifth Australasian conference on Computer science - Volume 4
Mercator: A scalable, extensible Web crawler
World Wide Web
The decay and failures of web references
Communications of the ACM
Finding Near-Replicas of Documents and Servers on the Web
WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
A large-scale study of the evolution of web pages
WWW '03 Proceedings of the 12th international conference on World Wide Web
zFS " A Scalable Distributed File System Using Object Disks
MSS '03 Proceedings of the 20 th IEEE/11 th NASA Goddard Conference on Mass Storage Systems and Technologies (MSS'03)
Estimating frequency of change
ACM Transactions on Internet Technology (TOIT)
Design and Implementation of a High-Performance Distributed Web Crawler
ICDE '02 Proceedings of the 18th International Conference on Data Engineering
PROBABILISTIC ALGORITHM IN FINITE FIELDS
PROBABILISTIC ALGORITHM IN FINITE FIELDS
Why Writing Your Own Search Engine Is Hard
Queue - Search Engines
Characterizing a national community web
ACM Transactions on Internet Technology (TOIT)
Awarded Best Student Paper! - Pond: The OceanStore Prototype
FAST '03 Proceedings of the 2nd USENIX Conference on File and Storage Technologies
Modelling information persistence on the web
ICWE '06 Proceedings of the 6th international conference on Web engineering
The Viúva Negra crawler: an experience report
Software—Practice & Experience
De-duplication-based archival storage system
CIT'09 Proceedings of the 3rd International Conference on Communications and information technology
Using visual pages analysis for optimizing web archiving
Proceedings of the 2010 EDBT/ICDT Workshops
Fast approximate duplicate detection for 2D-NMR spectra
DILS'07 Proceedings of the 4th international conference on Data integration in the life sciences
Vi-DIFF: understanding web pages changes
DEXA'10 Proceedings of the 21st international conference on Database and expert systems applications: Part I
Archiving the web using page changes patterns: a case study
Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Design and selection criteria for a national web archive
ECDL'06 Proceedings of the 10th European conference on Research and Advanced Technology for Digital Libraries
Hi-index | 0.00 |
Crawlers harvest the web by iteratively downloading documents referenced by URLs. It is frequent to find different URLs that refer to the same document, leading crawlers to download duplicates. Hence, web archives built through incremental crawls waste space storing these documents. In this paper, we study the existence of duplicates within a web archive and discuss strategies to eliminate them at storage level during the crawl. We present a storage system architecture that addresses the requirements of web archives and detail its implementation and evaluation. The system is now supporting an archive for the Portuguese web replacing previous NFS-based storage servers. Experimental results showed that the elimination of duplicates can improve storage throughput. The web storage system outperformed NFS based storage by 68% in read operations and by 50% in write operations.1