Evaluation of header metadata extraction approaches and tools for scientific PDF documents

Authors:
Mario Lipinski;Kevin Yao;Corinna Breitinger;Joeran Beel;Bela Gipp
Affiliations:
UC Berkeley, Berkeley, CA, USA;UC Berkeley, Berkeley, CA, USA;UC Berkeley, Berkeley, CA, USA;UC Berkeley, Berkeley, CA, USA;UC Berkeley, Berkeley, CA, USA
Venue:
Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
Year:
2013

Citing 4
Cited 0

SciPlore Xtract: extracting titles from scientific PDF documents by analyzing style information

ECDL'10 Proceedings of the 14th European conference on Research and advanced technology for digital libraries
Introducing Mr. DLib, a Machine-readable Digital Library

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
A comparison of metadata extraction techniques for crowdsourced bibliographic metadata management

Proceedings of the 27th Annual ACM Symposium on Applied Computing
Docear's PDF inspector: title extraction from PDF files

Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper evaluates the performance of tools for the extraction of metadata from scientific articles. Accurate metadata extraction is an important task for automating the management of digital libraries. This comparative study is a guide for developers looking to integrate the most suitable and effective metadata extraction tool into their software. We shed light on the strengths and weaknesses of seven tools in common use. In our evaluation using papers from the arXiv collection, GROBID delivered the best results, followed by Mendeley Desktop. SciPlore Xtract, PDFMeat, and SVMHeaderParse also delivered good results depending on the metadata type to be extracted.