A survey of document image classification: problem statement, classifier architecture and performance evaluation

Authors:
Nawei Chen;Dorothea Blostein
Affiliations:
Queen’s University, School of Computing, K7L 3N6, Kingston, ON, Canada;Queen’s University, School of Computing, K7L 3N6, Kingston, ON, Canada
Venue:
International Journal on Document Analysis and Recognition
Year:
2007

Citing 0
Cited 9

A novel efficient classification algorithm for search engines

AIC'08 Proceedings of the 8th conference on Applied informatics and communications
The Diagonal Split: A Pre-segmentation Step for Page Layout Analysis and Classification

IbPRIA '09 Proceedings of the 4th Iberian Conference on Pattern Recognition and Image Analysis
Hierarchical Ensemble Support Cluster Machine

MCS '09 Proceedings of the 8th International Workshop on Multiple Classifier Systems
Picture extraction from digitized historical manuscripts

Proceedings of the ACM International Conference on Image and Video Retrieval
Surfing on artistic documents with visually assisted tagging

Proceedings of the international conference on Multimedia
Automatic segmentation of digitalized historical manuscripts

Multimedia Tools and Applications
Rule based document understanding of historical books using a hybrid fuzzy classification system

Proceedings of the 2011 Workshop on Historical Document Imaging and Processing
I know what you are reading: recognition of document types using mobile eye tracking

Proceedings of the 2013 International Symposium on Wearable Computers
Near-duplicate document image matching: A graphical perspective

Pattern Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

Document image classification is an important step in Office Automation, Digital Libraries, and other document image analysis applications. There is great diversity in document image classifiers: they differ in the problems they solve, in the use of training data to construct class models, and in the choice of document features and classification algorithms. We survey this diverse literature using three components: the problem statement, the classifier architecture, and performance evaluation. This brings to light important issues in designing a document classifier, including the definition of document classes, the choice of document features and feature representation, and the choice of classification algorithm and learning mechanism. We emphasize techniques that classify single-page typeset document images without using OCR results. Developing a general, adaptable, high-performance classifier is challenging due to the great variety of documents, the diverse criteria used to define document classes, and the ambiguity that arises due to ill-defined or fuzzy document classes.