Faithful mathematical formula recognition from PDF documents

  • Authors:
  • Josef B. Baker;Alan P. Sexton;Volker Sorge

  • Affiliations:
  • University of Birmingham;University of Birmingham;University of Birmingham

  • Venue:
  • DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present an approach to extracting mathematical formulae directly from PDF documents. We exploit both the perfect character information as well as additional font and spacing information available from a PDF document to ensure a faithful recognition of mathematical expressions. The extracted information can be post-processed to produce suitable markup that can be re-inserted into the PDF documents in order to enable the handling of mathematical formulae by accessibility technology. Furthermore, we demonstrate how we recognise different types of mathematical objects, such as relations, operators, etc., without reference to predefined knowledge or dictionary lookup, using character clustering and interspace and character font information alone, all of which contributes to our goal of reconstructing the intended semantics of a formula from its presentation.