Semi automated metadata extraction for preprints archives

  • Authors:
  • Emma Tonkin;Henk L. Muller

  • Affiliations:
  • University of Bath, Bath, United Kingdom;University of Bristol, Bristol, United Kingdom

  • Venue:
  • Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
  • Year:
  • 2008

Quantified Score

Hi-index 0.01

Visualization

Abstract

In this paper we present a system called paperBase that aids users in entering metadata for preprints. PaperBase extracts metadata from the preprint. Using a Dublin-Core based REST API, third-party repository software populates a web form that the user can then proofread and complete. PaperBase also predicts likely key words for the preprints, based on a controlled vocabulary of keywords that the archive uses and a Bayesian classifier. We have tested the system on 12 individuals, and measured the time that it took them to enter data, and the accuracy of the entered metadata. We find that our system appears to be faster than manual entry, but a larger sample needs to be tested before it can be deemed statistically significant. All but two participants perceived it to be faster. Some metadata, in particular the title of preprints, contains significantly fewer mistakes when entered automatically; even though the automatic system is not perfect, people tend to correct mistakes that paperBase makes, but would leave their own mistakes in place.