Docear's PDF inspector: title extraction from PDF files

Authors:
Joeran Beel;Stefan Langer;Marcel Genzmehr;Christoph Müller
Affiliations:
Docear, Magdeburg, Germany;Docear, Magdeburg, Germany;Docear, Magdeburg, Germany;Docear, Magdeburg, Germany
Venue:
Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
Year:
2013

Citing 4
Cited 1

Automatic document metadata extraction using support vector machines

Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
Automatic extraction of titles from general documents using machine learning

Information Processing and Management: an International Journal
SciPlore Xtract: extracting titles from scientific PDF documents by analyzing style information

ECDL'10 Proceedings of the 14th European conference on Research and advanced technology for digital libraries
Docear: an academic literature suite for searching, organizing and creating academic literature

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries

Evaluation of header metadata extraction approaches and tools for scientific PDF documents

Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this demo-paper we present Docear's PDF Inspector (DPI). DPI extracts titles from academic PDF files by applying a simple heuristic: the largest text on the first page of a PDF is assumed to be the title. This simple heuristic achieves accuracies around 70% and outperforms the tools ParsCit and SciPlore Xtract in both run-time and accuracy. In addition, DPI is released under the free open source license GPL 2+ at http://www.docear.org, written in JAVA, and runs on any major operating system.