Language independent thresholding optimization using a Gaussian mixture modelling of the character shapes

Authors:
Yves Rangoni;Joost van Beusekom;Thomas M. Breuel
Affiliations:
German Research Center for Artificial Intelligence (DFKI) GmbH;Technical University of Kaiserslautern, Kaiserslautern, Germany;German Research Center for Artificial Intelligence (DFKI) GmbH and Technical University of Kaiserslautern, Kaiserslautern, Germany
Venue:
Proceedings of the International Workshop on Multilingual OCR
Year:
2009

Citing 11
Cited 1

Evaluation of Binarization Methods for Document Images

IEEE Transactions on Pattern Analysis and Machine Intelligence
Validation of Image Defect Models for Optical Character Recognition

IEEE Transactions on Pattern Analysis and Machine Intelligence
Goal-Directed Evaluation of Binarization Methods

IEEE Transactions on Pattern Analysis and Machine Intelligence
A Method for Objective Edge Detection Evaluation and Detector Parameter Selection

IEEE Transactions on Pattern Analysis and Machine Intelligence
An empirical study of smoothing techniques for language modeling

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
A Comparison of Binarization Methods for Historical Archive Documents

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
An Overview of the Tesseract OCR Engine

ICDAR '07 Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 02
Google Book Search: Document Understanding on a Massive Scale

ICDAR '07 Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 02
A comparative study of local thresholding methods for document image binarization

Machine Graphics & Vision International Journal
Estimation of proper parameter values for document binarization

CGIM '08 Proceedings of the Tenth IASTED International Conference on Computer Graphics and Imaging
Automatic evaluation of document binarization results

CIARP'05 Proceedings of the 10th Iberoamerican Congress conference on Progress in Pattern Recognition, Image Analysis and Applications

Multilingual OCR research and applications: an overview

Proceedings of the 4th International Workshop on Multilingual OCR

Quantified Score

Hi-index	0.00

Visualization

Abstract

One of the first steps in a digitization process is the binarization of the document image. The major further steps like layout analysis, line extraction, and text recognition assume a black and white image as input. Several thresholding methods have been proposed to handle this problem for document images, but few of them take into account the behaviour of the text recognizer. They often rely on parameters that depend on the class of documents. In a large-scale process, neither relying on empirical assumptions nor using a manual tuning is conceivable. In this paper, we introduce statistical modelling of a suitable binarization for a character recognizer. The model is a mixture of Gaussians that gives the prior of a binarization for having the best suitable transcription afterwards. The training is done on the character level, and tuned specifically for the recognizer. The optimization consists in finding the binarization that produces the best character shapes according to the model. As opposed to existing methods, the optimization is goal-directed, and is not linked to subjective visual criterions. On the one hand, our method uses high-level character shape information to improve preprocessing, resulting in a language independent system. On the other hand, it can be trained in an unsupervised way, significantly reducing the need for human intervention. We demonstrate the effectiveness of this approach, called Gaussian Mixture Token Thresholding, on a subset of the Google 1000 Books dataset containing old documents where we achieve an improvement of more than 10 points compared to a regular binarization.