A comprehensive evaluation methodology for noisy historical document recognition techniques

  • Authors:
  • Nikolaos Stamatopoulos;Georgios Louloudis;Basilis Gatos

  • Affiliations:
  • Institute of Informatics and Telecommunications, NCSR "Demokritos", Agia Paraskevi, Athens, Greece;Institute of Informatics and Telecommunications, NCSR "Demokritos", Agia Paraskevi, Athens, Greece and University of Athens, Greece;Institute of Informatics and Telecommunications, NCSR "Demokritos", Agia Paraskevi, Athens, Greece

  • Venue:
  • Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we propose a new comprehensive methodology in order to evaluate the performance of noisy historical document recognition techniques. We aim to evaluate not only the final noisy recognition result but also the main intermediate stages of text line, word and character segmentation. For this purpose, we efficiently create the text line, word and character segmentation ground truth guided by the transcription of the historical documents. The proposed methodology consists of (i) a semiautomatic procedure in order to detect the text line, word and character segmentation ground truth regions making use of the correct document transcription, (ii) calculation of proper evaluation metrics in order to measure the performance of the final OCR result as well as of the intermediate segmentation stages. The semi-automatic procedure for detecting the ground truth regions has been evaluated and proved efficient and time saving. Experimental results prove that using the proposed technique, the percentage of time saved for the text line, word and character segmentation ground truth creation is more than 90%. An analytic experiment using a commercial OCR engine applied to a historical book is also presented.