Recognition and classification of figures in PDF documents

Authors:
Mingyan Shao;Robert P. Futrelle
Affiliations:
Northeastern University, Boston, MA;Northeastern University, Boston, MA
Venue:
GREC'05 Proceedings of the 6th international conference on Graphics Recognition: ten Years Review and Future Perspectives
Year:
2005

Citing 10
Cited 3

Understanding Diagrams in Technical Documents

Computer
Machine Interpretation of Line Drawing Images: Technical Drawings, Maps, and Diagrams

Machine Interpretation of Line Drawing Images: Technical Drawings, Maps, and Diagrams
Ambiguity in Visual Language Theory and its Role in Diagram Parsing

VL '99 Proceedings of the IEEE Symposium on Visual Languages
Efficient analysis of complex diagrams using constraint-based parsing

ICDAR '95 Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 2) - Volume 2
Automatic generation of intelligent diagram editors

ACM Transactions on Computer-Human Interaction (TOCHI)
Making Documents Work: Challenges for Document Understanding

ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 2
Extraction, layout analysis and classification of diagrams in PDF documents

ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 2
Xed: A New Tool for eXtracting Hidden Structures from Electronic Documents

DIAL '04 Proceedings of the First International Workshop on Document Image Analysis for Libraries (DIAL'04)
Creating structured PDF files using XML templates

Proceedings of the 2004 ACM symposium on Document engineering
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Model-based chart image classification

ISVC'11 Proceedings of the 7th international conference on Advances in visual computing - Volume Part II
ReVision: automated classification, analysis and redesign of chart images

Proceedings of the 24th annual ACM symposium on User interface software and technology
Towards retrieving relevant information graphics

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Graphics recognition for raster-based input discovers primitives such as lines, arrowheads, and circles. This paper focuses on graphics recognition of figures in vector-based PDF documents. The first stage consists of extracting the graphic and text primitives corresponding to figures. An interpreter was constructed to translate PDF content into a set of self-contained graphics and text objects (in Java), freed from the intricacies of the PDF file. The second stage consists of discovering simple graphics entities which we call graphemes, e.g., a pair of primitive graphic objects satisfying certain geometric constraints. The third stage uses machine learning to classify figures using grapheme statistics as attributes. A boosting-based learner (LogitBoost in the Weka toolkit) was able to achieve 100% classification accuracy in hold-out-one training/testing using 16 grapheme types extracted from 36 figures from BioMed Central journal research papers. The approach can readily be adapted to raster graphics recognition.