Word Segmentation of Handwritten Dates in Historical Documents by Combining Semantic A-Priori-Knowledge with Local Features

  • Authors:
  • Markus Feldbach;Klaus D. Tönnies

  • Affiliations:
  • -;-

  • Venue:
  • ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

The recognition of script in historical documents requiressuitable techniques in order to identify single words.Segmentation of lines and words is a challenging task becauselines are not straight and words may intersect withinand between lines. For correct word segmentation, the conventionalanalysis of distances between text objects needsto be supplemented by a second component predicting possibleword boundaries based on semantical information.For date entries, hypotheses about potential boundaries aregenerated based on knowledge about the different variationsas to how dates are written in the documents. It ismodeled by distribution curves for potential boundary locations.Word boundaries are detected by classification oflocal features, such as distances between adjacent text objects,together with location-based boundary distributioncurves as a-priori knowledge. We applied the technique todate entries in historical church registers. Documents fromthe 18th and 19th century were used for training and testing.The data set consisted of 674 word boundaries in 298date entries. Our algorithm found the correct separationunder the best four hypotheses for a word sequence in 97%of all cases in the test data set.