Information Retrieval can Cope with Many Errors

  • Authors:
  • Elke Mittendorf;Peter Schäuble

  • Affiliations:
  • Systor A6, CH-8048 Zürich, Switzerland. elke.mittendorf@systor.com;Eurospider Information Technology AG, CH-8006 Zürich, Switzerland. schauble@eurospider.ch

  • Venue:
  • Information Retrieval
  • Year:
  • 2000

Quantified Score

Hi-index 0.00

Visualization

Abstract

The retrieval of documents that originate from digitized and OCR-converted paper documents is an important task for modern retrieval systems. The problems that OCR errors cause for the retrieval process have been subject to research for several years now. We approach the problem from a theoretical point of view and model OCR conversion as a random experiment. Our theoretical results, which are supported by experiments, show clearly that information retrieval can cope even with many errors. It is, however, important that the documents are not too short and that recognition errors are distributed appropriately among words and documents. These results disclose that an expensive manual or automatic post-processing of OCR-converted documents usually does not make sense, but that scanning and OCR must be performed in an appropriate way and with care.