The TREC-5 Confusion Track: Comparing Retrieval Methods for Scanned Text

  • Authors:
  • Paul B. Kantor;Ellen M. Voorhees

  • Affiliations:
  • Department of Library and Information Science, Rutgers University, 4 Huntington St. New Brunswick, NJ 08901, USA;National Institute of Standards and Technology (NIST), Gaithersburg, MD 20899, USA

  • Venue:
  • Information Retrieval
  • Year:
  • 2000

Quantified Score

Hi-index 0.00

Visualization

Abstract

A known-item search is a particular information retrieval task in which the system is asked to find a single target document in a large document set. The TREC-5 confusion track used a set of 49 known-item tasks to study the impact of data corruption on retrieval system performance. Two corrupted versions of a 55,600 document corpus whose true content was known were created by applying OCR techniques to page images. The first version of the corpus used the page images as scanned, resulting in an estimated character error rate of approximately 5%. The second version used page images that had been down-sampled, resulting in an estimated character error rate of approximately 20%. The true text and each of the corrupted versions were then searched using the same set of 49 questions. In general, retrieval methods that attempted a probabilistic reconstruction of the original clean text fared better than methods that simply accepted corrupted versions of the query text.