Information retrieval strategies for digitized handwritten medieval documents

Authors:
Nada Naji;Jacques Savoy
Affiliations:
Computer Science Department, University of Neuchatel, Neuchatel, Switzerland;Computer Science Department, University of Neuchatel, Neuchatel, Switzerland
Venue:
AIRS'11 Proceedings of the 7th Asia conference on Information Retrieval Technology
Year:
2011

Citing 13
Cited 0

Results of applying probabilistic IR to OCR text

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Experimentation as a way of life: Okapi at TREC

Information Processing and Management: an International Journal - The sixth text REtrieval conference (TREC-6)
Query-based sampling of text databases

ACM Transactions on Information Systems (TOIS)
Probabilistic models of information retrieval based on measuring the divergence from randomness

ACM Transactions on Information Systems (TOIS)
Information retrieval and OCR: from converting content to grasping meaning

ACM SIGIR Forum
Character N-Gram Tokenization for European Language Text Retrieval

Information Retrieval
A search engine for historical manuscript images

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Boosted decision trees for word recognition in handwritten document retrieval

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Using controlled query generation to evaluate blind relevance feedback algorithms

Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Automatic construction of known-item finding test beds

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Information Retrieval

Introduction to Information Retrieval
Automatic Transcription of Handwritten Medieval Documents

VSMM '09 Proceedings of the 2009 15th International Conference on Virtual Systems and Multimedia
Bootstrap-Based comparisons of IR metrics for finding one relevant document

AIRS'06 Proceedings of the Third Asia conference on Information Retrieval Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes and evaluates different IR models and search strategies for digitized manuscripts. Written during the thirteenth century, these manuscripts were digitized using an imperfect recognition system with a word error rate of around 6%. Having access to the internal representation during the recognition stage, we were able to produce four automatic transcriptions, each introducing some form of spelling correction as an attempt to improve the retrieval effectiveness. We evaluated the retrieval effectiveness for each of these versions using three text representations combined with five IR models, three stemming strategies and two query formulations. We employed a manually-transcribed error-free version to define the ground-truth. Based on our experiments, we conclude that taking account of the single best recognition word or all possible top-k recognition alternatives does not provide the best performance. Selecting all possible words each having a log-likelihood close to the best alternative yields the best text surrogate. Within this representation, different retrieval strategies tend to produce similar performance levels.