Harvesting needed to maintain scientific literature online

Authors:
Nikolay Nikolov;Peter Stoehr
Affiliations:
European Bioinformatics Institute, Cambridge, United Kingdom;European Bioinformatics Institute, Cambridge, United Kingdom
Venue:
Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Year:
2008

Citing 1
Cited 2

Search engines and their public interfaces: which apis are the most synchronized?

Proceedings of the 16th international conference on World Wide Web

Mashing up life science literature resources

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Generating citation digests for scientific publications

Proceedings of the 10th annual joint conference on Digital libraries

Quantified Score

Hi-index	0.00

Visualization

Abstract

Millions of scientific articles are accessible freely on the web. While some of them are stored in institutional repositories many are made available on personal pages which are exposed to the net's transience. We found that nearly 11% of URLs of PDF documents containing references to life science publications were not accessible within 5 months after being harvested using a search engine's (SE) API. For most of them (8.4%) no SE cache backup could be found. Although we have yet to estimate the exact rate at which the scientific literature disappears and the duration of its disappearance the results so far are a clear indicator that web harvesting is needed to preserve the online scientific literature.