Managing duplicates in a web archive

Authors:
Daniel Gomes;André L. Santos;Mário J. Silva
Affiliations:
Universidade de Lisboa, Lisboa, Portugal;Universidade de Lisboa, Lisboa, Portugal;Universidade de Lisboa, Lisboa, Portugal
Venue:
Proceedings of the 2006 ACM symposium on Applied computing
Year:
2006

Citing 16
Cited 8

Archival storage for digital libraries

Proceedings of the third ACM conference on Digital libraries
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Mirror, mirror on the Web: a study of host pairs with replicated content

WWW '99 Proceedings of the eighth international conference on World Wide Web
Aliasing on the world wide web: prevalence and performance implications

Proceedings of the 11th international conference on World Wide Web
Signature extraction for overlap detection in documents

ACSC '02 Proceedings of the twenty-fifth Australasian conference on Computer science - Volume 4
Mercator: A scalable, extensible Web crawler

World Wide Web
The decay and failures of web references

Communications of the ACM
Finding Near-Replicas of Documents and Servers on the Web

WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
A large-scale study of the evolution of web pages

WWW '03 Proceedings of the 12th international conference on World Wide Web
zFS " A Scalable Distributed File System Using Object Disks

MSS '03 Proceedings of the 20 th IEEE/11 th NASA Goddard Conference on Mass Storage Systems and Technologies (MSS'03)
Estimating frequency of change

ACM Transactions on Internet Technology (TOIT)
Design and Implementation of a High-Performance Distributed Web Crawler

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
PROBABILISTIC ALGORITHM IN FINITE FIELDS

PROBABILISTIC ALGORITHM IN FINITE FIELDS
Why Writing Your Own Search Engine Is Hard

Queue - Search Engines
Characterizing a national community web

ACM Transactions on Internet Technology (TOIT)
Awarded Best Student Paper! - Pond: The OceanStore Prototype

FAST '03 Proceedings of the 2nd USENIX Conference on File and Storage Technologies

Modelling information persistence on the web

ICWE '06 Proceedings of the 6th international conference on Web engineering
The Viúva Negra crawler: an experience report

Software—Practice & Experience
De-duplication-based archival storage system

CIT'09 Proceedings of the 3rd International Conference on Communications and information technology
Using visual pages analysis for optimizing web archiving

Proceedings of the 2010 EDBT/ICDT Workshops
Fast approximate duplicate detection for 2D-NMR spectra

DILS'07 Proceedings of the 4th international conference on Data integration in the life sciences
Vi-DIFF: understanding web pages changes

DEXA'10 Proceedings of the 21st international conference on Database and expert systems applications: Part I
Archiving the web using page changes patterns: a case study

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Design and selection criteria for a national web archive

ECDL'06 Proceedings of the 10th European conference on Research and Advanced Technology for Digital Libraries

Quantified Score

Hi-index	0.00

Visualization

Abstract

Crawlers harvest the web by iteratively downloading documents referenced by URLs. It is frequent to find different URLs that refer to the same document, leading crawlers to download duplicates. Hence, web archives built through incremental crawls waste space storing these documents. In this paper, we study the existence of duplicates within a web archive and discuss strategies to eliminate them at storage level during the crawl. We present a storage system architecture that addresses the requirements of web archives and detail its implementation and evaluation. The system is now supporting an archive for the Portuguese web replacing previous NFS-based storage servers. Experimental results showed that the elimination of duplicates can improve storage throughput. The web storage system outperformed NFS based storage by 68% in read operations and by 50% in write operations.1