Recognition and classification of figures in PDF documents

  • Authors:
  • Mingyan Shao;Robert P. Futrelle

  • Affiliations:
  • Northeastern University, Boston, MA;Northeastern University, Boston, MA

  • Venue:
  • GREC'05 Proceedings of the 6th international conference on Graphics Recognition: ten Years Review and Future Perspectives
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Graphics recognition for raster-based input discovers primitives such as lines, arrowheads, and circles. This paper focuses on graphics recognition of figures in vector-based PDF documents. The first stage consists of extracting the graphic and text primitives corresponding to figures. An interpreter was constructed to translate PDF content into a set of self-contained graphics and text objects (in Java), freed from the intricacies of the PDF file. The second stage consists of discovering simple graphics entities which we call graphemes, e.g., a pair of primitive graphic objects satisfying certain geometric constraints. The third stage uses machine learning to classify figures using grapheme statistics as attributes. A boosting-based learner (LogitBoost in the Weka toolkit) was able to achieve 100% classification accuracy in hold-out-one training/testing using 16 grapheme types extracted from 36 figures from BioMed Central journal research papers. The approach can readily be adapted to raster graphics recognition.