Comparative information retrieval evaluation for scanned documents

  • Authors:
  • Jacques Savoy;Nada Naji

  • Affiliations:
  • Computer Science Department, University of Neuchatel, Neuchâtel, Switzerland;Computer Science Department, University of Neuchatel, Neuchâtel, Switzerland

  • Venue:
  • Proceedings of the 15th WSEAS international conference on Computers
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper evaluates degradation in retrieval effectiveness when working with a noisy text corpus. We first start with a clean version of the collection, and then a second for which the recognition error rate is about 5% and a third of 20%. In our experiments we evaluate six IR models based on three text representations (word-based, n-gram, trunc-n) as well as three stemming strategies. With mean reciprocal rank as a performance measure, we show that degradation in retrieval effectiveness is around 17% when dealing with an error rate of 5%, and by increasing this error rate to 20%, the average decrease is around 46%. Text representation by means of 4-grams tends to result in better quality when searching within noisy texts. Finally, we are not able to come to any clear conclusions regarding the impact of different stemming strategies or the use of blind-query expansion.