SciPlore Xtract: extracting titles from scientific PDF documents by analyzing style information

  • Authors:
  • Jöran Beel;Bela Gipp;Ammar Shaker;Nick Friedrich

  • Affiliations:
  • Otto-von-Guericke University, Computer Science, ITI, VLBA-Lab, Magdeburg, Germany and UC Berkeley, Berkeley, California;Otto-von-Guericke University, Computer Science, ITI, VLBA-Lab, Magdeburg, Germany and UC Berkeley, Berkeley, California;Otto-von-Guericke University, Computer Science, ITI, VLBA-Lab, Magdeburg, Germany;Otto-von-Guericke University, Computer Science, ITI, VLBA-Lab, Magdeburg, Germany

  • Venue:
  • ECDL'10 Proceedings of the 14th European conference on Research and advanced technology for digital libraries
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Extracting titles from a PDF's full text is an important task in information retrieval to identify PDFs. Existing approaches apply complicated and expensive (in terms of calculating power) machine learning algorithms such as Support Vector Machines and Conditional Random Fields. In this paper we present a simple rule based heuristic, which considers style information (font size) to identify a PDF's title. In a first experiment we show that this heuristic delivers better results (77.9% accuracy) than a support vector machine by CiteSeer (69.4% accuracy) in an 'academic search engine' scenario and better run times (8:19 minutes vs. 57:26 minutes).