Extraction, layout analysis and classification of diagrams in PDF documents

Authors:
Robert P. Futrelle;Mingyan Shao;Chris Cieslik;Andrea Elaina Grimes
Affiliations:
-;-;-;-
Venue:
ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 2
Year:
2003

Citing 17
Cited 10

Twenty Years of Document Image Analysis in PAMI

IEEE Transactions on Pattern Analysis and Machine Intelligence
Machine Learning for Intelligent Processing of Printed Documents

Journal of Intelligent Information Systems - Special issue on methodologies for intelligent information systems
An introduction to support Vector Machines: and other kernel-based learning methods

An introduction to support Vector Machines: and other kernel-based learning methods
Document Image Analysis: An Executive Briefing

Document Image Analysis: An Executive Briefing
Machine Interpretation of Line Drawing Images: Technical Drawings, Maps, and Diagrams

Machine Interpretation of Line Drawing Images: Technical Drawings, Maps, and Diagrams
Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms

Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms
A Tutorial on Support Vector Machines for Pattern Recognition

Data Mining and Knowledge Discovery
Selected Papers from the Fourth International Workshop on Graphics Recognition Algorithms and Applications

GREC '01 Selected Papers from the Fourth International Workshop on Graphics Recognition Algorithms and Applications
Issues in Ground-Truthing Graphic Documents

GREC '01 Selected Papers from the Fourth International Workshop on Graphics Recognition Algorithms and Applications
Applications of Support Vector Machines for Pattern Recognition: A Survey

SVM '02 Proceedings of the First International Workshop on Pattern Recognition with Support Vector Machines
A Study on the Document Zone Content Classification Problem

DAS '02 Proceedings of the 5th International Workshop on Document Analysis Systems V
Distinguishing photographs and graphics on the World Wide Web

CAIVL '97 Proceedings of the 1997 Workshop on Content-Based Access of Image and Video Libraries (CBAIVL '97)
Ambiguity in Visual Language Theory and its Role in Diagram Parsing

VL '99 Proceedings of the IEEE Symposium on Visual Languages
Efficient analysis of complex diagrams using constraint-based parsing

ICDAR '95 Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 2) - Volume 2
AIDAS: Incremental Logical Structure Discovery in PDF Documents

ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
Diagram understanding using integration of layout information and textual information

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2

Document zone content classification and its performance evaluation

Pattern Recognition
Object-level document analysis of PDF files

Proceedings of the 9th ACM symposium on Document engineering
Improving XED for extracting content from Arabic PDFs

DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
Figure classification in biomedical literature to elucidate disease mechanisms, based on pathways

Artificial Intelligence in Medicine
Security and privacy issues in the Portable Document Format

Journal of Systems and Software
GOAL: towards understanding of graphic objects from architectural to line drawings

GREC'09 Proceedings of the 8th international conference on Graphics recognition: achievements, challenges, and evolution
A fast technique for vectorization of engineering drawings using morphology and digital straightness

Proceedings of the Seventh Indian Conference on Computer Vision, Graphics and Image Processing
Recognition and classification of figures in PDF documents

GREC'05 Proceedings of the 6th international conference on Graphics Recognition: ten Years Review and Future Perspectives
XCDF: a canonical and structured document format

DAS'06 Proceedings of the 7th international conference on Document Analysis Systems
Understanding Digital Documents Using Gestalt Properties of Isothetic Components

International Journal of Digital Library Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Diagrams are a critical part of virtually all scientificand technical documents. Analyzing diagrams will beimportant for building comprehensive document retrievalsystems. This paper focuses on the extraction andclassification of diagrams from PDF documents. Westudy diagrams available in vector (not raster) format inonline research papers.PDF files are parsed and their vector graphicscomponents installed in a spatial index. Subdiagrams arefound by analyzing white space gaps. A set of statistics isgenerated for each diagram, e.g., the number ofhorizontal lines and vertical lines. The statistics form afeature vector description of the diagram. The vectorsare used in a kernel-based machine learning system(Support Vector Machine). Separating a set of bargraphs from non-bar-graphs gathered from 20,000biology research papers gave a classification accuracy of91.7%. The approach is directly applicable to diagramsvectorized from images.