An optical character recognition approach to qualifying thresholding algorithms

  • Authors:
  • Margaret Sturgill;Steven J. Simske

  • Affiliations:
  • Hewlett Packard Labs, Fort Collins, CO, USA;Hewlett Packard Labs, Fort Collins, CO, USA

  • Venue:
  • Proceedings of the eighth ACM symposium on Document engineering
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Pre-processing for raster image based document segmentation begins with image thresholding, which is a binarization process separating foreground from background. In this paper, we compare an existing (Otsu), modified existing (Kittler-Illingworth) and simple peak-based thresholding approach on a set of 982 documents for which existing ground truth (full text) is available. We use the output of an open source OCR engine which incorporates an adaptive/dynamic thresholder that can be bypassed by one of the three global thresholds we tested. This allowed comparison of these three approaches in the aggregate. We then used an independently-generated dictionary as a means of characterizing thresholder efficacy. Such an approach, if successful, will provide the means for selecting an optimal thresholder in the absence of a large set of ground truthed documents. Our preliminary findings here indicate that this approach may provide a reliable means for thresholder comparison and eventually preclude the need for time-intensive human ground truthing.