Semi automated metadata extraction for preprints archives

Authors:
Emma Tonkin;Henk L. Muller
Affiliations:
University of Bath, Bath, United Kingdom;University of Bristol, Bristol, United Kingdom
Venue:
Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Year:
2008

Citing 4
Cited 1

CiteSeer: an automatic citation indexing system

Proceedings of the third ACM conference on Digital libraries
An efficient context-free parsing algorithm

Communications of the ACM
Automatic information extraction from semi-structured Web pages by pattern discovery

Decision Support Systems - Web retrieval and mining
Understanding Search Engines: Mathematical Modeling and Text Retrieval (Software, Environments, Tools), Second Edition

Understanding Search Engines: Mathematical Modeling and Text Retrieval (Software, Environments, Tools), Second Edition

MetRe: supporting the metadata revision process

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries

Quantified Score

Hi-index	0.01

Visualization

Abstract

In this paper we present a system called paperBase that aids users in entering metadata for preprints. PaperBase extracts metadata from the preprint. Using a Dublin-Core based REST API, third-party repository software populates a web form that the user can then proofread and complete. PaperBase also predicts likely key words for the preprints, based on a controlled vocabulary of keywords that the archive uses and a Bayesian classifier. We have tested the system on 12 individuals, and measured the time that it took them to enter data, and the accuracy of the entered metadata. We find that our system appears to be faster than manual entry, but a larger sample needs to be tested before it can be deemed statistically significant. All but two participants perceived it to be faster. Some metadata, in particular the title of preprints, contains significantly fewer mistakes when entered automatically; even though the automatic system is not perfect, people tend to correct mistakes that paperBase makes, but would leave their own mistakes in place.