Extraction, layout analysis and classification of diagrams in PDF documents

  • Authors:
  • Robert P. Futrelle;Mingyan Shao;Chris Cieslik;Andrea Elaina Grimes

  • Affiliations:
  • -;-;-;-

  • Venue:
  • ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 2
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

Diagrams are a critical part of virtually all scientificand technical documents. Analyzing diagrams will beimportant for building comprehensive document retrievalsystems. This paper focuses on the extraction andclassification of diagrams from PDF documents. Westudy diagrams available in vector (not raster) format inonline research papers.PDF files are parsed and their vector graphicscomponents installed in a spatial index. Subdiagrams arefound by analyzing white space gaps. A set of statistics isgenerated for each diagram, e.g., the number ofhorizontal lines and vertical lines. The statistics form afeature vector description of the diagram. The vectorsare used in a kernel-based machine learning system(Support Vector Machine). Separating a set of bargraphs from non-bar-graphs gathered from 20,000biology research papers gave a classification accuracy of91.7%. The approach is directly applicable to diagramsvectorized from images.