Efficient OCR post-processing combining language, hypothesis and error models

Authors:
Rafael Llobet;J. Ramon Navarro-Cerdan;Juan-Carlos Perez-Cortes;Joaquim Arlandis
Affiliations:
Instituto Tecnologico de Informatica, Universidad Politecnica de Valencia, Valencia, Spain;Instituto Tecnologico de Informatica, Universidad Politecnica de Valencia, Valencia, Spain;Instituto Tecnologico de Informatica, Universidad Politecnica de Valencia, Valencia, Spain;Instituto Tecnologico de Informatica, Universidad Politecnica de Valencia, Valencia, Spain
Venue:
SSPR&SPR'10 Proceedings of the 2010 joint IAPR international conference on Structural, syntactic, and statistical pattern recognition
Year:
2010

Citing 9
Cited 0

A logical framework for the correction of spelling errors in electronic documents

Information Processing and Management: an International Journal
Inference of k-Testable Languages in the Strict Sense and Application to Syntactic Pattern Recognition

IEEE Transactions on Pattern Analysis and Machine Intelligence
Efficient Error-Correcting Viterbi Parsing

IEEE Transactions on Pattern Analysis and Machine Intelligence
A design principles of a weighted finite-state transducer library

Theoretical Computer Science - Special issue on implementing automata
Approximate String Matching

ACM Computing Surveys (CSUR)
Probabilistic Finite-State Machines-Part I

IEEE Transactions on Pattern Analysis and Machine Intelligence
A Weighted Finite-State Framework for Correcting Errors in Natural Scene OCR

ICDAR '07 Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 02
Phrase-based correction model for improving handwriting recognition accuracies

Pattern Recognition
OpenFst: a general and efficient weighted finite-state transducer library

CIAA'07 Proceedings of the 12th international conference on Implementation and application of automata

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, an OCR post-processing method that combines a language model, OCR hypothesis information and an error model is proposed. The approach can be seen as a flexible and efficient way to perform Stochastic Error-Correcting Language Modeling. We use Weighted Finite-State Transducers (WFSTs) to represent the language model, the complete set of OCR hypotheses interpreted as a sequence of vectors of a posteriori class probabilities, and an error model with symbol substitutions, insertions and deletions. This approach combines the practical advantages of a de-coupled (OCR + post-processor) model with the error-recovery power of a integrated model.