A Complete Optical Character Recognition Methodology for Historical Documents

Authors:
G. Vamvakas;B. Gatos;N. Stamatopoulos;S. J. Perantonis
Affiliations:
-;-;-;-
Venue:
DAS '08 Proceedings of the 2008 The Eighth IAPR International Workshop on Document Analysis Systems
Year:
2008

Citing 0
Cited 5

A comprehensive evaluation methodology for noisy historical document recognition techniques

Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
Introducing a new image dissimilarity measure with an application to character image clustering in degraded historical documents

DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
Word spotting in historical printed documents using shape and sequence comparisons

Pattern Recognition
W-TSV: Weighted topological signature vector for lexicon reduction in handwritten Arabic documents

Pattern Recognition
Modeling broken characters recognition as a set-partitioning problem

Pattern Recognition Letters

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper a complete OCR methodology for recognizing historical documents, either printed or handwritten without any knowledge of the font, is presented. This methodology consists of three steps: The first two steps refer to creating a database for training using a set of documents, while the third one refers to recognition of new document images. First, a pre-processing step that includes image binarization and enhancement takes place. At a second step a top-down segmentation approach is used in order to detect text lines, words and characters. A clustering scheme is then adopted in order to group characters of similar shape. This is a semi-automatic procedure since the user is able to interact at any time in order to correct possible errors of clustering and assign an ASCII label. After this step, a database is created in order to be used for recognition. Finally, in the third step, for every new document image the above segmentation approach takes place while the recognition is based onthe character database that has been produced at the previous step