A Statistical, Nonparametric Methodology for Document Degradation Model Validation

  • Authors:
  • Tapas Kanungo;Robert M. Haralick;Werner Stuezle;Henry S. Baird;David Madigan

  • Affiliations:
  • Univ. of Maryland, College Park;Univ. of Washington, Seattle;Univ. of Washington, Seattle;Xerox Palo Alto Research Center, Palo Alto, CA;Soliloquy Inc., New York, NY

  • Venue:
  • IEEE Transactions on Pattern Analysis and Machine Intelligence
  • Year:
  • 2000

Quantified Score

Hi-index 0.15

Visualization

Abstract

Printing, photocopying, and scanning processes degrade the image quality of a document. Statistical models of these degradation processes are crucial for document image understanding research. Models allow us to predict system performance, conduct controlled experiments to study the breakdown points of the systems, create large multilingual data sets with groundtruth for training classifiers, design optimal noise removal algorithms, choose values for the free parameters of the algorithms, and so on. Although research in document understanding started many decades ago, only two document degradation models have been proposed thus far. Furthermore, no attempts have been made to statistically validate these models. In this paper, we present a statistical methodology that can be used to validate local degradation models. This method is based on a nonparametric, two-sample permutation test. Another standard statistical device驴the power function驴is then used to choose between algorithm variables such as distance functions. Since the validation and the power function procedures are independent of the model, they can be used to validate any other degradation model. A method for comparing any two models is also described. It uses p-values associated with the estimated models to select the model that is closer to the real world.