Information Retrieval can Cope with Many Errors

Authors:
Elke Mittendorf;Peter Schäuble
Affiliations:
Systor A6, CH-8048 Zürich, Switzerland. elke.mittendorf@systor.com;Eurospider Information Technology AG, CH-8006 Zürich, Switzerland. schauble@eurospider.ch
Venue:
Information Retrieval
Year:
2000

Citing 16
Cited 9

Automatic text processing

Automatic text processing
Probabilistic models in information retrieval

The Computer Journal - Special issue on information retrieval
Word sense disambiguation and information retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Results of applying probabilistic IR to OCR text

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Metadata for integrating speech documents in a text retrieval system

ACM SIGMOD Record
Applying probabilistic term weighting to OCR text in the case of a large alphabetic library catalogue

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Query expansion using local and global document analysis

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Pivoted document length normalization

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Retrieving spoken documents by combining multiple index sources

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Probabilistic Retrieval of OCR Degraded Text Using N-Grams

ECDL '97 Proceedings of the First European Conference on Research and Advanced Technology for Digital Libraries
Automatic Hypertext Conversion of Paper Document Collections

Selected Papers from the Digital Libraries Workshop on Digital Libraries: Current Issues
Post-processing of OCR results for automatic indexing

ICDAR '95 Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 2) - Volume 2
Assessing the retrieval effectiveness of a speech retrieval system by simulating recognition errors

HLT '94 Proceedings of the workshop on Human Language Technology
The SMART Retrieval System—Experiments in Automatic Document Processing

The SMART Retrieval System—Experiments in Automatic Document Processing
Speech retrieval based on automatic indexing

MIRO'95 Proceedings of the Final conference on Multimedia Information Retrieval

An Investigation of Mixed-Media Information Retrieval

ECDL '02 Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries
Examining and improving the effectiveness of relevance feedback for retrieval of scanned text documents

Information Processing and Management: an International Journal
Document image analysis for active reading

SADPI '07 Proceedings of the 2007 international workshop on Semantically aware document processing and indexing
Effect of OCR error correction on Arabic retrieval

Information Retrieval
Text Retrieval through Corrupted Queries

IBERAMIA '08 Proceedings of the 11th Ibero-American conference on AI: Advances in Artificial Intelligence
Efficient Language-Independent Retrieval of Printed Documents without OCR

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Examining and improving the effectiveness of relevance feedback for retrieval of scanned text documents

Information Processing and Management: an International Journal
Comparative information retrieval evaluation for scanned documents

Proceedings of the 15th WSEAS international conference on Computers
Using string comparison in context for improved relevance feedback in different text media

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

The retrieval of documents that originate from digitized and OCR-converted paper documents is an important task for modern retrieval systems. The problems that OCR errors cause for the retrieval process have been subject to research for several years now. We approach the problem from a theoretical point of view and model OCR conversion as a random experiment. Our theoretical results, which are supported by experiments, show clearly that information retrieval can cope even with many errors. It is, however, important that the documents are not too short and that recognition errors are distributed appropriately among words and documents. These results disclose that an expensive manual or automatic post-processing of OCR-converted documents usually does not make sense, but that scanning and OCR must be performed in an appropriate way and with care.