SciPlore Xtract: extracting titles from scientific PDF documents by analyzing style information

Authors:
Jöran Beel;Bela Gipp;Ammar Shaker;Nick Friedrich
Affiliations:
Otto-von-Guericke University, Computer Science, ITI, VLBA-Lab, Magdeburg, Germany and UC Berkeley, Berkeley, California;Otto-von-Guericke University, Computer Science, ITI, VLBA-Lab, Magdeburg, Germany and UC Berkeley, Berkeley, California;Otto-von-Guericke University, Computer Science, ITI, VLBA-Lab, Magdeburg, Germany;Otto-von-Guericke University, Computer Science, ITI, VLBA-Lab, Magdeburg, Germany
Venue:
ECDL'10 Proceedings of the 14th European conference on Research and advanced technology for digital libraries
Year:
2010

Citing 3
Cited 3

Automatic document metadata extraction using support vector machines

Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
Information extraction from research papers using conditional random fields

Information Processing and Management: an International Journal
Automatic extraction of titles from general documents using machine learning

Information Processing and Management: an International Journal

A practical method for compatibility evaluation of portable document formats

ACIIDS'13 Proceedings of the 5th Asian conference on Intelligent Information and Database Systems - Volume Part II
Evaluation of header metadata extraction approaches and tools for scientific PDF documents

Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
Docear's PDF inspector: title extraction from PDF files

Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries

Quantified Score

Hi-index	0.00

Visualization

Abstract

Extracting titles from a PDF's full text is an important task in information retrieval to identify PDFs. Existing approaches apply complicated and expensive (in terms of calculating power) machine learning algorithms such as Support Vector Machines and Conditional Random Fields. In this paper we present a simple rule based heuristic, which considers style information (font size) to identify a PDF's title. In a first experiment we show that this heuristic delivers better results (77.9% accuracy) than a support vector machine by CiteSeer (69.4% accuracy) in an 'academic search engine' scenario and better run times (8:19 minutes vs. 57:26 minutes).