Efficient OCR post-processing combining language, hypothesis and error models

  • Authors:
  • Rafael Llobet;J. Ramon Navarro-Cerdan;Juan-Carlos Perez-Cortes;Joaquim Arlandis

  • Affiliations:
  • Instituto Tecnologico de Informatica, Universidad Politecnica de Valencia, Valencia, Spain;Instituto Tecnologico de Informatica, Universidad Politecnica de Valencia, Valencia, Spain;Instituto Tecnologico de Informatica, Universidad Politecnica de Valencia, Valencia, Spain;Instituto Tecnologico de Informatica, Universidad Politecnica de Valencia, Valencia, Spain

  • Venue:
  • SSPR&SPR'10 Proceedings of the 2010 joint IAPR international conference on Structural, syntactic, and statistical pattern recognition
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, an OCR post-processing method that combines a language model, OCR hypothesis information and an error model is proposed. The approach can be seen as a flexible and efficient way to perform Stochastic Error-Correcting Language Modeling. We use Weighted Finite-State Transducers (WFSTs) to represent the language model, the complete set of OCR hypotheses interpreted as a sequence of vectors of a posteriori class probabilities, and an error model with symbol substitutions, insertions and deletions. This approach combines the practical advantages of a de-coupled (OCR + post-processor) model with the error-recovery power of a integrated model.