A Complete Optical Character Recognition Methodology for Historical Documents

  • Authors:
  • G. Vamvakas;B. Gatos;N. Stamatopoulos;S. J. Perantonis

  • Affiliations:
  • -;-;-;-

  • Venue:
  • DAS '08 Proceedings of the 2008 The Eighth IAPR International Workshop on Document Analysis Systems
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper a complete OCR methodology for recognizing historical documents, either printed or handwritten without any knowledge of the font, is presented. This methodology consists of three steps: The first two steps refer to creating a database for training using a set of documents, while the third one refers to recognition of new document images. First, a pre-processing step that includes image binarization and enhancement takes place. At a second step a top-down segmentation approach is used in order to detect text lines, words and characters. A clustering scheme is then adopted in order to group characters of similar shape. This is a semi-automatic procedure since the user is able to interact at any time in order to correct possible errors of clustering and assign an ASCII label. After this step, a database is created in order to be used for recognition. Finally, in the third step, for every new document image the above segmentation approach takes place while the recognition is based onthe character database that has been produced at the previous step