Document Style Census for OCR

Authors:
George Nagy;Prateek Sarkar
Affiliations:
-;-
Venue:
DIAL '04 Proceedings of the First International Workshop on Document Image Analysis for Libraries (DIAL'04)
Year:
2004

Citing 0
Cited 2

Document image analysis for digital libraries

Proceedings of the 2006 international workshop on Research issues in digital libraries
Interactive, mobile, distributed pattern recognition

ICIAP'05 Proceedings of the 13th international conference on Image Analysis and Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Four methods of converting paper documents to computer-readable form are compared with regard to hypothetical labor cost: keyboarding, omnifont OCR, style-specific OCR, and style-constrained or style-adaptive OCR. The best choice is determined primarily by (1) the reject rates of the various OCR systems at a given error rate, (2) the fraction of the material that must be labeled for training the system, and (3) the cost of partitioning the material according to style. For large corpora, sampling strategies are proposed both for estimating conversion costs and for taking advantage of style homogeneity.