Improving Book OCR by Adaptive Language and Image Models

Authors:
Dar-Shyang Lee;Ray Smith
Affiliations:
-;-
Venue:
DAS '12 Proceedings of the 2012 10th IAPR International Workshop on Document Analysis Systems
Year:
2012

Citing 0
Cited 1

Unsupervised language model adaptation for handwritten Chinese text recognition

Pattern Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

In order to cope with the vast diversity of book content and typefaces, it is important for OCR systems to leverage the strong consistency within a book but adapt to variations across books. We describe a system that combines two parallel correction paths using document-specific image and language models. Each model adapts to shapes and vocabularies within a book to identify inconsistencies as correction hypotheses, but relies on the other for effective cross-validation. Using the open source Tesseract engine as baseline, results on a large data set of scanned books demonstrate that word error rates can be reduced by 25 percent using this approach.