A Corpus for Comparative Evaluation of OCR Software and Postcorrection Techniques

Authors:
Stoyan Mihov;Klaus U. Schulz;Christoph Ringlstetter;Veselka Dojchinova;Vanja Nakova
Affiliations:
IPP - Bulgarian Academy of Sciences, Sofia;CIS, University of Munich;CIS, University of Munich;IPP - Bulgarian Academy of Sciences, Sofia;IPP - Bulgarian Academy of Sciences, Sofia
Venue:
ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Year:
2005

Citing 4
Cited 1

Techniques for automatically correcting words in text

ACM Computing Surveys (CSUR)
Optical Character Recognition: An Illustrated Guide to the Frontier

Optical Character Recognition: An Illustrated Guide to the Frontier
The TREC-5 Confusion Track: Comparing Retrieval Methods for Scanned Text

Information Retrieval
The Same is Not The same - Post Correction of Alphabet Confusion Erros in Mixed-Alphabet OCR Recognation

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition

The Same is Not The same - Post Correction of Alphabet Confusion Erros in Mixed-Alphabet OCR Recognation

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

We describe a new corpus collected for comparative evaluation of OCR-software and postcorrection techniques. The corpus is freely available for academic groups and use. The major part of the corpus (2306 files) consists of Bulgarian documents. Many of these documents come with Cyrillic and Latin symbols. A smaller corpus with German documents has been added. All original documents represent real-life paper documents collected from enterprises and organizations. Most genres of written language and various document types are covered. The corpus contains the corresponding image files, rich meta-data, textual files obtained via OCR recognition, ground truth data for hundreds of example pages, and alignment software for experiments.