A Corpus for Comparative Evaluation of OCR Software and Postcorrection Techniques

  • Authors:
  • Stoyan Mihov;Klaus U. Schulz;Christoph Ringlstetter;Veselka Dojchinova;Vanja Nakova

  • Affiliations:
  • IPP - Bulgarian Academy of Sciences, Sofia;CIS, University of Munich;CIS, University of Munich;IPP - Bulgarian Academy of Sciences, Sofia;IPP - Bulgarian Academy of Sciences, Sofia

  • Venue:
  • ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

We describe a new corpus collected for comparative evaluation of OCR-software and postcorrection techniques. The corpus is freely available for academic groups and use. The major part of the corpus (2306 files) consists of Bulgarian documents. Many of these documents come with Cyrillic and Latin symbols. A smaller corpus with German documents has been added. All original documents represent real-life paper documents collected from enterprises and organizations. Most genres of written language and various document types are covered. The corpus contains the corresponding image files, rich meta-data, textual files obtained via OCR recognition, ground truth data for hundreds of example pages, and alignment software for experiments.