Comparative information retrieval evaluation for scanned documents

Authors:
Jacques Savoy;Nada Naji
Affiliations:
Computer Science Department, University of Neuchatel, Neuchâtel, Switzerland;Computer Science Department, University of Neuchatel, Neuchâtel, Switzerland
Venue:
Proceedings of the 15th WSEAS international conference on Computers
Year:
2011

Citing 17
Cited 0

Results of applying probabilistic IR to OCR text

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluation of model-based retrieval effectiveness with OCR text

ACM Transactions on Information Systems (TOIS)
Statistical inference in retrieval effectiveness evaluation

Information Processing and Management: an International Journal
Experimentation as a way of life: Okapi at TREC

Information Processing and Management: an International Journal - The sixth text REtrieval conference (TREC-6)
Probabilistic models of information retrieval based on measuring the divergence from randomness

ACM Transactions on Information Systems (TOIS)
Information Retrieval from Documents: A Survey

Information Retrieval
The TREC-5 Confusion Track: Comparing Retrieval Methods for Scanned Text

Information Retrieval
Information Retrieval can Cope with Many Errors

Information Retrieval
Probabilistic Retrieval of OCR Degraded Text Using N-Grams

ECDL '97 Proceedings of the First European Conference on Research and Advanced Technology for Digital Libraries
Information retrieval and OCR: from converting content to grasping meaning

ACM SIGIR Forum
Character N-Gram Tokenization for European Language Text Retrieval

Information Retrieval
Introduction to the special issue on computational linguistics using large corpora

Computational Linguistics - Special issue on using large corpora: I
Information access in the presence of OCR errors

Proceedings of the 1st ACM workshop on Hardcopy document processing
TREC: Continuing information retrieval's tradition of experimentation

Communications of the ACM
Introduction to Information Retrieval

Introduction to Information Retrieval
Improvements that don't add up: ad-hoc retrieval results since 1998

Proceedings of the 18th ACM conference on Information and knowledge management
Comparative Study of Indexing and Search Strategies for the Hindi, Marathi, and Bengali Languages

ACM Transactions on Asian Language Information Processing (TALIP)

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper evaluates degradation in retrieval effectiveness when working with a noisy text corpus. We first start with a clean version of the collection, and then a second for which the recognition error rate is about 5% and a third of 20%. In our experiments we evaluate six IR models based on three text representations (word-based, n-gram, trunc-n) as well as three stemming strategies. With mean reciprocal rank as a performance measure, we show that degradation in retrieval effectiveness is around 17% when dealing with an error rate of 5%, and by increasing this error rate to 20%, the average decrease is around 46%. Text representation by means of 4-grams tends to result in better quality when searching within noisy texts. Finally, we are not able to come to any clear conclusions regarding the impact of different stemming strategies or the use of blind-query expansion.