Results of applying probabilistic IR to OCR text
SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Experimentation as a way of life: Okapi at TREC
Information Processing and Management: an International Journal - The sixth text REtrieval conference (TREC-6)
Query-based sampling of text databases
ACM Transactions on Information Systems (TOIS)
Probabilistic models of information retrieval based on measuring the divergence from randomness
ACM Transactions on Information Systems (TOIS)
Character N-Gram Tokenization for European Language Text Retrieval
Information Retrieval
A search engine for historical manuscript images
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Boosted decision trees for word recognition in handwritten document retrieval
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Using controlled query generation to evaluate blind relevance feedback algorithms
Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Automatic construction of known-item finding test beds
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Information Retrieval
Introduction to Information Retrieval
Automatic Transcription of Handwritten Medieval Documents
VSMM '09 Proceedings of the 2009 15th International Conference on Virtual Systems and Multimedia
Bootstrap-Based comparisons of IR metrics for finding one relevant document
AIRS'06 Proceedings of the Third Asia conference on Information Retrieval Technology
Hi-index | 0.00 |
This paper describes and evaluates different IR models and search strategies for digitized manuscripts. Written during the thirteenth century, these manuscripts were digitized using an imperfect recognition system with a word error rate of around 6%. Having access to the internal representation during the recognition stage, we were able to produce four automatic transcriptions, each introducing some form of spelling correction as an attempt to improve the retrieval effectiveness. We evaluated the retrieval effectiveness for each of these versions using three text representations combined with five IR models, three stemming strategies and two query formulations. We employed a manually-transcribed error-free version to define the ground-truth. Based on our experiments, we conclude that taking account of the single best recognition word or all possible top-k recognition alternatives does not provide the best performance. Selecting all possible words each having a log-likelihood close to the best alternative yields the best text surrogate. Within this representation, different retrieval strategies tend to produce similar performance levels.