A search engine for historical manuscript images

Authors:
Toni M. Rath;R. Manmatha;Victor Lavrenko
Affiliations:
University of Massachusetts, Amherst, MA;University of Massachusetts, Amherst, MA;University of Massachusetts, Amherst, MA
Venue:
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Year:
2004

Citing 7
Cited 33

Word spotting: indexing handwritten manuscripts

Intelligent multimedia information retrieval
A language modeling approach to information retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Cross-lingual relevance models

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary

ECCV '02 Proceedings of the 7th European Conference on Computer Vision-Part IV
Automatic image annotation and retrieval using cross-media relevance models

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Modeling annotated data

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Holistic Word Recognition for Handwritten Historical Documents

DIAL '04 Proceedings of the First International Workshop on Document Image Analysis for Libraries (DIAL'04)

A Scale Space Approach for Automatically Segmenting Words from Historical Handwritten Documents

IEEE Transactions on Pattern Analysis and Machine Intelligence
Boosted decision trees for word recognition in handwritten document retrieval

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Ontology Guided Access to Document Images

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Making History: an Emergent System for the Systematic Accrual of Transcriptions of Historic Manuscripts

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Font Adaptive Word Indexing of Modern Printed Documents

IEEE Transactions on Pattern Analysis and Machine Intelligence
Text search for medieval manuscript images

Pattern Recognition
Document image analysis for digital libraries

Proceedings of the 2006 international workshop on Research issues in digital libraries
Keyword Spotting Techniques for Sanskrit Documents

Sanskrit Computational Linguistics
Towards an omnilingual word retrieval system for ancient manuscripts

Pattern Recognition
Hierarchical approximate matching for retrieval of chinese historical calligraphy character

Journal of Computer Science and Technology
Handwritten document retrieval strategies

Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
Finding words in alphabet soup: Inference on freeform character recognition for historical scripts

Pattern Recognition
A probabilistic method for keyword retrieval in handwritten document images

Pattern Recognition
A Web-Based Search Engine for Chinese Calligraphic Manuscript Images

ICWL '009 Proceedings of the 8th International Conference on Advances in Web Based Learning
Efficient Language-Independent Retrieval of Printed Documents without OCR

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Efficient search in document image collections

ACCV'07 Proceedings of the 8th Asian conference on Computer vision - Volume Part I
Nearest neighbor based collection OCR

DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
TF-Tree: an interactive and efficient retrieval of Chinese calligraphic manuscript images based on triple features

Proceedings of the ACM International Conference on Image and Video Retrieval
A Document Image Retrieval System

Engineering Applications of Artificial Intelligence
Probabilistic and interactive retrieval of chinese calligraphic character images based on multiple features

DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications - Volume Part I
Text line segmentation for gray scale historical document images

Proceedings of the 2011 Workshop on Historical Document Imaging and Processing
Learning shapes for image classification and retrieval

CIVR'05 Proceedings of the 4th international conference on Image and Video Retrieval
Ranking fusion methods applied to on-line handwriting information retrieval

ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Lexicon-free handwritten word spotting using character HMMs

Pattern Recognition Letters
Aligning transcripts to automatically segmented handwritten manuscripts

DAS'06 Proceedings of the 7th international conference on Document Analysis Systems
Experiences with shape classification through fuzzy c-means using geometrical and moments descriptors

AMR'10 Proceedings of the 8th international conference on Adaptive Multimedia Retrieval: context, exploration, and fusion
Information retrieval strategies for digitized handwritten medieval documents

AIRS'11 Proceedings of the 7th Asia conference on Information Retrieval Technology
A synthesised word approach to word retrieval in handwritten documents

Pattern Recognition
Exploring digital libraries with document image retrieval

ECDL'07 Proceedings of the 11th European conference on Research and Advanced Technology for Digital Libraries
Recognition of Kannada characters extracted from scene images

Proceeding of the workshop on Document Analysis and Recognition
Using Lucene to index and search the digitized 1940 US census

Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery
Text line extraction for historical document images

Pattern Recognition Letters
Keyword spotting in unconstrained handwritten Chinese documents using contextual word model

Image and Vision Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many museum and library archives are digitizing their large collections of handwritten historical manuscripts to enable public access to them. These collections are only available in image formats and require expensive manual annotation work for access to them. Current handwriting recognizers have word error rates in excess of 50% and therefore cannot be used for such material. We describe two statistical models for retrieval in large collections of handwritten manuscripts given a text query. Both use a set of transcribed page images to learn a joint probability distribution between features computed from word images and their transcriptions. The models can then be used to retrieve unlabeled images of handwritten documents given a text query. We show experiments with a training set of 100 transcribed pages and a test set of 987 handwritten page images from the George Washington collection. Experiments show that the precision at 20 documents is about 0.4 to 0.5 depending on the model. To the best of our knowledge, this is the first automatic retrieval system for historical manuscripts using text queries, without manual transcription of the original corpus.