Error correction vs. query garbling for Arabic OCR document retrieval

  • Authors:
  • Kareem Darwish;Walid Magdy

  • Affiliations:
  • IBM Technology Development Center, Cairo, Abou Rawash, Egypt;IBM Technology Development Center, Cairo, Abou Rawash, Egypt

  • Venue:
  • ACM Transactions on Information Systems (TOIS)
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Due to the existence of large numbers of legacy documents (such as old books and newspapers), improving retrieval effectiveness for OCR'ed documents continues to be an important problem. This article compares the effect of OCR error correction with and without language modeling and the effect of query garbling with weighted structured queries on the retrieval of OCR degraded Arabic documents. The results suggest that moderate error correction does not yield statistically significant improvement in retrieval effectiveness when indexing and searching using n-grams. Also, reversing error correction models to perform query garbling in conjunction with weighted structured queries yields improved retrieval effectiveness. Lastly, using very good error correction that utilizes language modeling yields the best improvement in retrieval effectiveness.