A Scale Space Approach for Automatically Segmenting Words from Historical Handwritten Documents

Authors:
R. Manmatha;Jamie L. Rothfeder
Affiliations:
IEEE Computer Society;-
Venue:
IEEE Transactions on Pattern Analysis and Machine Intelligence
Year:
2005

Citing 19
Cited 25

Off-Line Cursive Script Word Recognition

IEEE Transactions on Pattern Analysis and Machine Intelligence
Reading Chess

IEEE Transactions on Pattern Analysis and Machine Intelligence
Surface shape and curvature scales

Image and Vision Computing
A Survey of Methods and Strategies in Character Segmentation

IEEE Transactions on Pattern Analysis and Machine Intelligence
Word spotting: indexing handwritten manuscripts

Intelligent multimedia information retrieval
An Off-Line Cursive Handwriting Recognition System

IEEE Transactions on Pattern Analysis and Machine Intelligence
Twenty Years of Document Image Analysis in PAMI

IEEE Transactions on Pattern Analysis and Machine Intelligence
On-Line and Off-Line Handwriting Recognition: A Comprehensive Survey

IEEE Transactions on Pattern Analysis and Machine Intelligence
Scale-Space Theory in Computer Vision

Scale-Space Theory in Computer Vision
Segmentation of the Date in Entries of Historical Church Registers

Proceedings of the 24th DAGM Symposium on Pattern Recognition
Word Spotting: A New Approach to Indexing Handwriting

CVPR '96 Proceedings of the 1996 Conference on Computer Vision and Pattern Recognition (CVPR '96)
Document page decomposition by the bounding-box project

ICDAR '95 Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 2) - Volume 2
A Full English Sentence Database for Off-Line Handwriting Recognition

ICDAR '99 Proceedings of the Fifth International Conference on Document Analysis and Recognition
Gap metrics for word separation in handwritten lines

ICDAR '95 Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 1) - Volume 1
Transcript Mapping for Historic Handwritten Document Images

IWFHR '02 Proceedings of the Eighth International Workshop on Frontiers in Handwriting Recognition (IWFHR'02)
Text Line Segmentation and Word Recognition in a System for General Writer Independent Handwriting Recognition

ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
Fast Handwriting Recognition for Indexing Historical Documents

DIAL '04 Proceedings of the First International Workshop on Document Image Analysis for Libraries (DIAL'04)
Holistic Word Recognition for Handwritten Historical Documents

DIAL '04 Proceedings of the First International Workshop on Document Image Analysis for Libraries (DIAL'04)
A search engine for historical manuscript images

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval

Text search for medieval manuscript images

Pattern Recognition
Text line detection in handwritten documents

Pattern Recognition
Local Orientation Extraction for Wordspotting in Syriac Manuscripts

ICISP '08 Proceedings of the 3rd international conference on Image and Signal Processing
Pattern Recognition Methods for Querying and Browsing Technical Documentation

CIARP '08 Proceedings of the 13th Iberoamerican congress on Pattern Recognition: Progress in Pattern Recognition, Image Analysis and Applications
Towards an omnilingual word retrieval system for ancient manuscripts

Pattern Recognition
A method for combining complementary techniques for document image segmentation

Pattern Recognition
Text line and word segmentation of handwritten documents

Pattern Recognition
A method for combining complementary techniques for document image segmentation

Pattern Recognition
Handwritten document image segmentation into text lines and words

Pattern Recognition
Simultaneous Document Margin Removal and Skew Correction Based on Corner Detection in Projection Profiles

ICIAP '09 Proceedings of the 15th International Conference on Image Analysis and Processing
Segmentation of historical machine-printed documents using Adaptive Run Length Smoothing and skeleton segmentation paths

Image and Vision Computing
Ground truth creation for handwriting recognition in historical documents

DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
Indexation of Syriac manuscripts using directional features

ICIP'09 Proceedings of the 16th IEEE international conference on Image processing
Quasi-random nonlinear scale space

Pattern Recognition Letters
Generalized probabilistic scale space for image restoration

IEEE Transactions on Image Processing - Special section on distributed camera networks: sensing, processing, communication, and implementation
A new scheme for unconstrained handwritten text-line segmentation

Pattern Recognition
Automatic line and word segmentation applied to densely line-skewed historical handwritten document images

Integrated Computer-Aided Engineering
Handwritten word spotting in old manuscript images using a pseudo-structural descriptor organized in a hash structure

IbPRIA'11 Proceedings of the 5th Iberian conference on Pattern recognition and image analysis
A holistic methodology for keyword search in historical typewritten documents

SETN'06 Proceedings of the 4th Helenic conference on Advances in Artificial Intelligence
Lexicon-free handwritten word spotting using character HMMs

Pattern Recognition Letters
Aligning transcripts to automatically segmented handwritten manuscripts

DAS'06 Proceedings of the 7th international conference on Document Analysis Systems
A few steps towards on-the-fly symbol recognition with relevance feedback

DAS'06 Proceedings of the 7th international conference on Document Analysis Systems
Scale selection for supervised image segmentation

Image and Vision Computing
Text line extraction for historical document images

Pattern Recognition Letters
Intangible cultural heritage preservation: An exploratory study of digitization of the historical literature of Chinese Kunqu opera librettos

Journal on Computing and Cultural Heritage (JOCCH)

Quantified Score

Hi-index	0.14

Visualization

Abstract

Many libraries, museums, and other organizations contain large collections of handwritten historical documents, for example, the papers of early presidents like George Washington at the Library of Congress. The first step in providing recognition/retrieval tools is to automatically segment handwritten pages into words. State of the art segmentation techniques like the gap metrics algorithm have been mostly developed and tested on highly constrained documents like bank checks and postal addresses. There has been little work on full handwritten pages and this work has usually involved testing on clean artificial documents created for the purpose of research. Historical manuscript images, on the other hand, contain a great deal of noise and are much more challenging. Here, a novel scale space algorithm for automatically segmenting handwritten (historical) documents into words is described. First, the page is cleaned to remove margins. This is followed by a gray-level projection profile algorithm for finding lines in images. Each line image is then filtered with an anisotropic Laplacian at several scales. This procedure produces blobs which correspond to portions of characters at small scales and to words at larger scales. Crucial to the algorithm is scale selection, that is, finding the optimum scale at which blobs correspond to words. This is done by finding the maximum over scale of the extent or area of the blobs. This scale maximum is estimated using three different approaches. The blobs recovered at the optimum scale are then bounded with a rectangular box to recover the words. A postprocessing filtering step is performed to eliminate boxes of unusual size which are unlikely to correspond to words. The approach is tested on a number of different data sets and it is shown that, on 100 sampled documents from the George Washington corpus of handwritten document images, a total error rate of 17 percent is observed. The technique outperforms a state-of-the-art gap metrics word-segmentation algorithm on this collection.