Examining and improving the effectiveness of relevance feedback for retrieval of scanned text documents

Authors:
Adenike M. Lam-Adesina;Gareth J. F. Jones
Affiliations:
School of Computing and Centre for Digital Video Processing, Dublin City University, Glasnevin, Dublin, Ireland;School of Computing and Centre for Digital Video Processing, Dublin City University, Glasnevin, Dublin, Ireland
Venue:
Information Processing and Management: an International Journal
Year:
2006

Citing 11
Cited 3

On term selection for query expansion

Journal of Documentation
Evaluation of model-based retrieval effectiveness with OCR text

ACM Transactions on Information Systems (TOIS)
Effects of OCR errors on ranking and feedback using the vector space model

Information Processing and Management: an International Journal
Phonetic string matching: lessons from information retrieval

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Readings in information retrieval

Readings in information retrieval
Applying summarization techniques for term selection in relevance feedback

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
The TREC-5 Confusion Track: Comparing Retrieval Methods for Scanned Text

Information Retrieval
Information Retrieval can Cope with Many Errors

Information Retrieval
Probabilistic Retrieval of OCR Degraded Text Using N-Grams

ECDL '97 Proceedings of the First European Conference on Research and Advanced Technology for Digital Libraries
An Investigation of Mixed-Media Information Retrieval

ECDL '02 Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries
The role of manually-assigned keywords in query expansion

Information Processing and Management: an International Journal

Effect of OCR error correction on Arabic retrieval

Information Retrieval
CMIC at INEX 2007: Book Search Track

Focused Access to XML Documents
Book search: indexing the valuable parts

Proceedings of the 2008 ACM workshop on Research advances in large digital book repositories

Quantified Score

Hi-index	0.00

Visualization

Abstract

Important legacy paper documents are digitized and collected in online accessible archives. This enables the preservation, sharing, and significantly the searching of these documents. The text contents of these document images can be transcribed automatically using OCR systems and then stored in an information retrieval system. However, OCR systems make errors in character recognition which have previously been shown to impact on document retrieval behaviour. In particular relevance feedback query-expansion methods, which are often effective for improving electronic text retrieval, are observed to be less reliable for retrieval of scanned document images. Our experimental examination of the effects of character recognition errors on an ad hoc OCR retrieval task demonstrates that, while baseline information retrieval can remain relatively unaffected by transcription errors, relevance feedback via query expansion becomes highly unstable. This paper examines the reason for this behaviour, and introduces novel modifications to standard relevance feedback methods. These methods are shown experimentally to improve the effectiveness of relevance feedback for errorful OCR transcriptions. The new methods combine similar recognised character strings based on term collection frequency and a string edit-distance measure. The techniques are domain independent and make no use of external resources such as dictionaries or training data.