Word spotting for historical documents

Authors:
Tony M. Rath;R. Manmatha
Affiliations:
University of Massachusetts Amherst, Multimedia Indexing and Retrieval Group Center for Intelligent Information Retrieval, Department of Computer Science, 01003, Amherst, MA, USA;University of Massachusetts Amherst, Multimedia Indexing and Retrieval Group Center for Intelligent Information Retrieval, Department of Computer Science, 01003, Amherst, MA, USA
Venue:
International Journal on Document Analysis and Recognition
Year:
2007

Citing 0
Cited 37

Local Orientation Extraction for Wordspotting in Syriac Manuscripts

ICISP '08 Proceedings of the 3rd international conference on Image and Signal Processing
Pattern Recognition Methods for Querying and Browsing Technical Documentation

CIARP '08 Proceedings of the 13th Iberoamerican congress on Pattern Recognition: Progress in Pattern Recognition, Image Analysis and Applications
Handwritten word-spotting using hidden Markov models and universal vocabularies

Pattern Recognition
Finding centuries-old hyperlinks with a novel semi-supervised learning technique

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Text retrieval from early printed books

Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
A comprehensive evaluation methodology for noisy historical document recognition techniques

Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
Finding words in alphabet soup: Inference on freeform character recognition for historical scripts

Pattern Recognition
A probabilistic method for keyword retrieval in handwritten document images

Pattern Recognition
Efficient Language-Independent Retrieval of Printed Documents without OCR

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Example based video filters

Proceedings of the ACM International Conference on Image and Video Retrieval
A multi-scale framework for adaptive binarization of degraded document images

Pattern Recognition
Unsupervised writer adaptation of whole-word HMMs with application to word-spotting

Pattern Recognition Letters
Efficient search in document image collections

ACCV'07 Proceedings of the 8th Asian conference on Computer vision - Volume Part I
Ground truth creation for handwriting recognition in historical documents

DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
Nearest neighbor based collection OCR

DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
Towards more effective distance functions for word image matching

DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
Translating handwritten bushman texts

Proceedings of the 10th annual joint conference on Digital libraries
Indexation of Syriac manuscripts using directional features

ICIP'09 Proceedings of the 16th IEEE international conference on Image processing
Medieval manuscript layout model

Proceedings of the 10th ACM symposium on Document engineering
Document seal detection using GHT and character proximity graphs

Pattern Recognition
A line-based representation for matching words in historical manuscripts

Pattern Recognition Letters
Handwritten word spotting in old manuscript images using a pseudo-structural descriptor organized in a hash structure

IbPRIA'11 Proceedings of the 5th Iberian conference on Pattern recognition and image analysis
A keyword spotting approach using blurred shape model-based descriptors

Proceedings of the 2011 Workshop on Historical Document Imaging and Processing
Creating a handwriting recognition corpus for Bushman languages

ICADL'11 Proceedings of the 13th international conference on Asia-pacific digital libraries: for cultural heritage, knowledge dissemination, and future creation
Word spotting in historical printed documents using shape and sequence comparisons

Pattern Recognition
Lexicon-free handwritten word spotting using character HMMs

Pattern Recognition Letters
A synthesised word approach to word retrieval in handwritten documents

Pattern Recognition
DocExplore: overcoming cultural and physical barriers to access ancient documents

Proceedings of the 2012 ACM symposium on Document engineering
Dynamic Time Warping for Chinese calligraphic character matching and recognizing

Pattern Recognition Letters
Content level access to digital library of India pages

Proceedings of the Eighth Indian Conference on Computer Vision, Graphics and Image Processing
Recognition of hand-written archive text documents

ICCVG'12 Proceedings of the 2012 international conference on Computer Vision and Graphics
The ESPOSALLES database: An ancient marriage license corpus for off-line handwriting recognition

Pattern Recognition
Contextual word spotting in historical manuscripts using Markov logic networks

Proceedings of the 2nd International Workshop on Historical Document Imaging and Processing
Bag-of-features HMMs for segmentation-free Bangla word spotting

Proceedings of the 4th International Workshop on Multilingual OCR
Statistical script independent word spotting in offline handwritten documents

Pattern Recognition
Boosting the handwritten word spotting experience by including the user in the loop

Pattern Recognition
Character confidence based on N-best list for keyword spotting in online Chinese handwritten documents

Pattern Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

Searching and indexing historical handwritten collections are a very challenging problem. We describe an approach called word spotting which involves grouping word images into clusters of similar words by using image matching to find similarity. By annotating “interesting” clusters, an index that links words to the locations where they occur can be built automatically. Image similarities computed using a number of different techniques including dynamic time warping are compared. The word similarities are then used for clustering using both K-means and agglomerative clustering techniques. It is shown in a subset of the George Washington collection that such a word spotting technique can outperform a Hidden Markov Model word-based recognition technique in terms of word error rates.