Recognizing garbage in OCR output on historical documents

Authors:
Richard Wudtke;Christoph Ringlstetter;Klaus U. Schulz
Affiliations:
CIS -- University of Munich;CIS -- University of Munich;CIS -- University of Munich
Venue:
Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data
Year:
2011

Citing 8
Cited 0

Techniques for automatically correcting words in text

ACM Computing Surveys (CSUR)
Optical Character Recognition: An Illustrated Guide to the Frontier

Optical Character Recognition: An Illustrated Guide to the Frontier
A Tutorial on Support Vector Machines for Pattern Recognition

Data Mining and Knowledge Discovery
Adaptive text correction with Web-crawled domain-dependent dictionaries

ACM Transactions on Speech and Language Processing (TSLP)
Google Book Search: Document Understanding on a Massive Scale

ICDAR '07 Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 02
On lexical resources for digitization of historical documents

Proceedings of the 9th ACM symposium on Document engineering
Efficiently generating correction suggestions for garbled tokens of historical language

Natural Language Engineering
Towards information retrieval on historical document collections: the role of matching procedures and special lexica

International Journal on Document Analysis and Recognition - Special issue on noisy text analytics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Erroneous tokens in the output of an OCR engine can be roughly divided into two categories. For less serious OCR errors typically human readers - in many cases also text correction systems - are able to reconstruct the correct original word, or to suggest a small set of plausible corrections. Sometimes, however, the OCR output contains "garbage" output tokens for which it is completely impossible to predict the correct word. Garbage tokens are for example caused by graphics occurring in images misinterpreted as text by the OCR engine. In this paper we report on the development of a classifier for garbage tokens in OCR output on historical documents. The classifier is based on a specific feature set and implemented as a support vector machine. In our experiments it clearly outperformed simple rule-based predecessor solutions for OCR garbage detection.