Aligning transcripts to automatically segmented handwritten manuscripts

Authors:
Jamie Rothfeder;R. Manmatha;Toni M. Rath
Affiliations:
Department of Computer Science, University of Massachusetts Amherst, Amherst, MA;Department of Computer Science, University of Massachusetts Amherst, Amherst, MA;Department of Computer Science, University of Massachusetts Amherst, Amherst, MA
Venue:
DAS'06 Proceedings of the 7th international conference on Document Analysis Systems
Year:
2006

Citing 13
Cited 6

Prototype Extraction and Adaptive OCR

IEEE Transactions on Pattern Analysis and Machine Intelligence
Learning to Recognize Speech by Watching Television

IEEE Intelligent Systems
Scale Space Technique for Word Segmentation in Handwritten Documents

SCALE-SPACE '99 Proceedings of the Second International Conference on Scale-Space Theories in Computer Vision
Speaker Identification Based Text to Audio Alignment for an Audio Retrieval System

ICASSP '97 Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97)-Volume 2 - Volume 2
A Full English Sentence Database for Off-Line Handwriting Recognition

ICDAR '99 Proceedings of the Fifth International Conference on Document Analysis and Recognition
Transcript Mapping for Historic Handwritten Document Images

IWFHR '02 Proceedings of the Eighth International Workshop on Frontiers in Handwriting Recognition (IWFHR'02)
Text Alignment with Handwritten Documents

DIAL '04 Proceedings of the First International Workshop on Document Image Analysis for Libraries (DIAL'04)
Holistic Word Recognition for Handwritten Historical Documents

DIAL '04 Proceedings of the First International Workshop on Document Image Analysis for Libraries (DIAL'04)
Text-translation alignment

Computational Linguistics - Special issue on using large corpora: I
Offline Recognition of Unconstrained Handwritten Texts Using HMMs and Statistical Language Models

IEEE Transactions on Pattern Analysis and Machine Intelligence
A search engine for historical manuscript images

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
A Scale Space Approach for Automatically Segmenting Words from Historical Handwritten Documents

IEEE Transactions on Pattern Analysis and Machine Intelligence
HMM word and phrase alignment for statistical machine translation

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing

A hierarchical, HMM-based automatic evaluation of OCR accuracy for a digital library of books

Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Matching ottoman words: an image retrieval approach to historical document indexing

Proceedings of the 6th ACM international conference on Image and video retrieval
A line-based representation for matching words in historical manuscripts

Pattern Recognition Letters
User-assisted alignment of Arabic historical manuscripts

Proceedings of the 2011 Workshop on Historical Document Imaging and Processing
Transcription alignment of Latin manuscripts using hidden Markov models

Proceedings of the 2011 Workshop on Historical Document Imaging and Processing
Transcript mapping for handwritten Chinese documents by integrating character recognition model and geometric context

Pattern Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

Training and evaluation of techniques for handwriting recognition and retrieval is a challenge given that it is difficult to create large ground-truthed datasets. This is especially true for historical handwritten datasets. In many instances the ground truth has to be created by manually transcribing each word, which is a very labor intensive process. Sometimes transcriptions are available for some manuscripts. These transcriptions were created for other purposes and hence correspondence at the word, line, or sentence level may not be available. To be useful for training and evaluation, a word level correspondence must be available between the segmented handwritten word images and the ASCII transcriptions. Creating this correspondence or alignment is challenging because the segmentation is often errorful and the ASCII transcription may also have errors in it. Very little work has been done on the alignment of handwritten data to transcripts. Here, a novel Hidden Markov Model based automatic alignment algorithm is described and tested. The algorithm produces an average alignment accuracy of about 72.8% when aligning whole pages at a time on a set of 70 pages of the George Washington collection. This outperforms a dynamic time warping alignment algorithm by about 12% previously reported in the literature and tested on the same collection.