Language independent thresholding optimization using a Gaussian mixture modelling of the character shapes

  • Authors:
  • Yves Rangoni;Joost van Beusekom;Thomas M. Breuel

  • Affiliations:
  • German Research Center for Artificial Intelligence (DFKI) GmbH;Technical University of Kaiserslautern, Kaiserslautern, Germany;German Research Center for Artificial Intelligence (DFKI) GmbH and Technical University of Kaiserslautern, Kaiserslautern, Germany

  • Venue:
  • Proceedings of the International Workshop on Multilingual OCR
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

One of the first steps in a digitization process is the binarization of the document image. The major further steps like layout analysis, line extraction, and text recognition assume a black and white image as input. Several thresholding methods have been proposed to handle this problem for document images, but few of them take into account the behaviour of the text recognizer. They often rely on parameters that depend on the class of documents. In a large-scale process, neither relying on empirical assumptions nor using a manual tuning is conceivable. In this paper, we introduce statistical modelling of a suitable binarization for a character recognizer. The model is a mixture of Gaussians that gives the prior of a binarization for having the best suitable transcription afterwards. The training is done on the character level, and tuned specifically for the recognizer. The optimization consists in finding the binarization that produces the best character shapes according to the model. As opposed to existing methods, the optimization is goal-directed, and is not linked to subjective visual criterions. On the one hand, our method uses high-level character shape information to improve preprocessing, resulting in a language independent system. On the other hand, it can be trained in an unsupervised way, significantly reducing the need for human intervention. We demonstrate the effectiveness of this approach, called Gaussian Mixture Token Thresholding, on a subset of the Google 1000 Books dataset containing old documents where we achieve an improvement of more than 10 points compared to a regular binarization.