Pattern Recognition
Meta-algorithmic systems for document classification
Proceedings of the 2006 ACM symposium on Document engineering
Multimedia Tools and Applications
Layout-aware limiarization for readability enhancement of degraded historical documents
Proceedings of the 9th ACM symposium on Document engineering
Hi-index | 0.00 |
Pre-processing for raster image based document segmentation begins with image thresholding, which is a binarization process separating foreground from background. In this paper, we compare an existing (Otsu), modified existing (Kittler-Illingworth) and simple peak-based thresholding approach on a set of 982 documents for which existing ground truth (full text) is available. We use the output of an open source OCR engine which incorporates an adaptive/dynamic thresholder that can be bypassed by one of the three global thresholds we tested. This allowed comparison of these three approaches in the aggregate. We then used an independently-generated dictionary as a means of characterizing thresholder efficacy. Such an approach, if successful, will provide the means for selecting an optimal thresholder in the absence of a large set of ground truthed documents. Our preliminary findings here indicate that this approach may provide a reliable means for thresholder comparison and eventually preclude the need for time-intensive human ground truthing.