Page frame detection for marginal noise removal from scanned documents

  • Authors:
  • Faisal Shafait;Joost Van Beusekom;Daniel Keysers;Thomas M. Breuel

  • Affiliations:
  • Image Understanding and Pattern Recognition research group, German Research Center for Artificial Intelligence (DFKI) GmbH, Kaiserslautern, Germany;Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern, Germany;Image Understanding and Pattern Recognition research group, German Research Center for Artificial Intelligence (DFKI) GmbH, Kaiserslautern, Germany;Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern, Germany

  • Venue:
  • SCIA'07 Proceedings of the 15th Scandinavian conference on Image analysis
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

We describe and evaluate a method to robustly detect the page frame in document images, locating the actual page contents area and removing textual and non-textual noise along the page borders. We use a geometric matching algorithm to find the optimal page frame, which has the advantages of not assuming the existence of whitespace between noisy borders and actual page contents, and of giving a practical solution to the page frame detection problem without the need for parameter tuning. We define suitable performance measures and evaluate the algorithm on the UW-III database. The results show that the error rates are below 4% for each of the performance measures used. In addition, we demonstrate that the use of page frame detection reduces the optical character recognition (OCR) error rate by removing textual noise. Experiments using a commercial OCR system show that the error rate due to elements outside the page frame is reduced from 4.3% to 1.7% on the UW-III dataset.