Results of applying probabilistic IR to OCR text
SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluation of model-based retrieval effectiveness with OCR text
ACM Transactions on Information Systems (TOIS)
Statistical inference in retrieval effectiveness evaluation
Information Processing and Management: an International Journal
Experimentation as a way of life: Okapi at TREC
Information Processing and Management: an International Journal - The sixth text REtrieval conference (TREC-6)
Probabilistic models of information retrieval based on measuring the divergence from randomness
ACM Transactions on Information Systems (TOIS)
Information Retrieval from Documents: A Survey
Information Retrieval
The TREC-5 Confusion Track: Comparing Retrieval Methods for Scanned Text
Information Retrieval
Information Retrieval can Cope with Many Errors
Information Retrieval
Probabilistic Retrieval of OCR Degraded Text Using N-Grams
ECDL '97 Proceedings of the First European Conference on Research and Advanced Technology for Digital Libraries
Character N-Gram Tokenization for European Language Text Retrieval
Information Retrieval
Introduction to the special issue on computational linguistics using large corpora
Computational Linguistics - Special issue on using large corpora: I
Information access in the presence of OCR errors
Proceedings of the 1st ACM workshop on Hardcopy document processing
TREC: Continuing information retrieval's tradition of experimentation
Communications of the ACM
Introduction to Information Retrieval
Introduction to Information Retrieval
Improvements that don't add up: ad-hoc retrieval results since 1998
Proceedings of the 18th ACM conference on Information and knowledge management
Comparative Study of Indexing and Search Strategies for the Hindi, Marathi, and Bengali Languages
ACM Transactions on Asian Language Information Processing (TALIP)
Hi-index | 0.00 |
This paper evaluates degradation in retrieval effectiveness when working with a noisy text corpus. We first start with a clean version of the collection, and then a second for which the recognition error rate is about 5% and a third of 20%. In our experiments we evaluate six IR models based on three text representations (word-based, n-gram, trunc-n) as well as three stemming strategies. With mean reciprocal rank as a performance measure, we show that degradation in retrieval effectiveness is around 17% when dealing with an error rate of 5%, and by increasing this error rate to 20%, the average decrease is around 46%. Text representation by means of 4-grams tends to result in better quality when searching within noisy texts. Finally, we are not able to come to any clear conclusions regarding the impact of different stemming strategies or the use of blind-query expansion.