OCR binarization and image pre-processing for searching historical documents

Authors:
Maya R. Gupta;Nathaniel P. Jacobson;Eric K. Garcia
Affiliations:
Electrical Engineering, University of Washington, Seattle, Washington 98195, United States;Electrical Engineering, University of Washington, Seattle, Washington 98195, United States;Electrical Engineering, University of Washington, Seattle, Washington 98195, United States
Venue:
Pattern Recognition
Year:
2007

Citing 5
Cited 6

Binarization and multithresholding of document images using connectivity

CVGIP: Graphical Models and Image Processing
Digital Color Halftoning

Digital Color Halftoning
An Introduction to Digital Image Processing

An Introduction to Digital Image Processing
Goal-Directed Evaluation of Binarization Methods

IEEE Transactions on Pattern Analysis and Machine Intelligence
Binarization of Low Quality Text Using a Markov Random Field Model

ICPR '02 Proceedings of the 16 th International Conference on Pattern Recognition (ICPR'02) Volume 3 - Volume 3

Text retrieval from early printed books

Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
Neuro semantic thresholding using OCR software for high precision OCR applications

Image and Vision Computing
Unsupervised measures for parameter selection of binarization algorithms

Pattern Recognition
A line-based representation for matching words in historical manuscripts

Pattern Recognition Letters
Display text segmentation after learning best-fitted OCR binarization parameters

Expert Systems with Applications: An International Journal
An optimization for binarization methods by removing binary artifacts

Pattern Recognition Letters

Quantified Score

Hi-index	0.01

Visualization

Abstract

We consider the problem of document binarization as a pre-processing step for optical character recognition (OCR) for the purpose of keyword search of historical printed documents. A number of promising techniques from the literature for binarization, pre-filtering, and post-binarization denoising were implemented along with newly developed methods for binarization: an error diffusion binarization, a multiresolutional version of Otsu's binarization, and denoising by despeckling. The OCR in the ABBYY FineReader 7.1 SDK is used as a black box metric to compare methods. Results for 12 pages from six newspapers of differing quality show that performance varies widely by image, but that the classic Otsu method and Otsu-based methods perform best on average.