A Comparison of Text-Based Methods for Detecting Duplication in Scanned Document Databases

  • Authors:
  • Daniel P. Lopresti

  • Affiliations:
  • Bell Labs, Lucent Technologies Inc., 600 Mountain Avenue, Room 2D-447, Murray Hill, NJ 07974, USA. dpl@research.bell-labs.com

  • Venue:
  • Information Retrieval
  • Year:
  • 2001

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper presents an experimental evaluation of several text-based methods for detecting duplication in scanned document databases using uncorrected OCR output. This task is made challenging both by the wide range of degradations printed documents can suffer, and by conflicting interpretations of what it means to be a “duplicate.” We report results for four sets of experiments exploring various aspects of the problem space. While the techniques studied are generally robust in the face of most types of OCR errors, there are nonetheless important differences which we identify and discuss in detail.