Evaluation of Binarization Methods for Document Images
IEEE Transactions on Pattern Analysis and Machine Intelligence
Validation of Image Defect Models for Optical Character Recognition
IEEE Transactions on Pattern Analysis and Machine Intelligence
Goal-Directed Evaluation of Binarization Methods
IEEE Transactions on Pattern Analysis and Machine Intelligence
A Method for Objective Edge Detection Evaluation and Detector Parameter Selection
IEEE Transactions on Pattern Analysis and Machine Intelligence
An empirical study of smoothing techniques for language modeling
ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
A Comparison of Binarization Methods for Historical Archive Documents
ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
An Overview of the Tesseract OCR Engine
ICDAR '07 Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 02
Google Book Search: Document Understanding on a Massive Scale
ICDAR '07 Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 02
A comparative study of local thresholding methods for document image binarization
Machine Graphics & Vision International Journal
Estimation of proper parameter values for document binarization
CGIM '08 Proceedings of the Tenth IASTED International Conference on Computer Graphics and Imaging
Automatic evaluation of document binarization results
CIARP'05 Proceedings of the 10th Iberoamerican Congress conference on Progress in Pattern Recognition, Image Analysis and Applications
Multilingual OCR research and applications: an overview
Proceedings of the 4th International Workshop on Multilingual OCR
Hi-index | 0.00 |
One of the first steps in a digitization process is the binarization of the document image. The major further steps like layout analysis, line extraction, and text recognition assume a black and white image as input. Several thresholding methods have been proposed to handle this problem for document images, but few of them take into account the behaviour of the text recognizer. They often rely on parameters that depend on the class of documents. In a large-scale process, neither relying on empirical assumptions nor using a manual tuning is conceivable. In this paper, we introduce statistical modelling of a suitable binarization for a character recognizer. The model is a mixture of Gaussians that gives the prior of a binarization for having the best suitable transcription afterwards. The training is done on the character level, and tuned specifically for the recognizer. The optimization consists in finding the binarization that produces the best character shapes according to the model. As opposed to existing methods, the optimization is goal-directed, and is not linked to subjective visual criterions. On the one hand, our method uses high-level character shape information to improve preprocessing, resulting in a language independent system. On the other hand, it can be trained in an unsupervised way, significantly reducing the need for human intervention. We demonstrate the effectiveness of this approach, called Gaussian Mixture Token Thresholding, on a subset of the Google 1000 Books dataset containing old documents where we achieve an improvement of more than 10 points compared to a regular binarization.