Techniques for automatically correcting words in text
ACM Computing Surveys (CSUR)
Optical Character Recognition: An Illustrated Guide to the Frontier
Optical Character Recognition: An Illustrated Guide to the Frontier
A Tutorial on Support Vector Machines for Pattern Recognition
Data Mining and Knowledge Discovery
Adaptive text correction with Web-crawled domain-dependent dictionaries
ACM Transactions on Speech and Language Processing (TSLP)
Google Book Search: Document Understanding on a Massive Scale
ICDAR '07 Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 02
On lexical resources for digitization of historical documents
Proceedings of the 9th ACM symposium on Document engineering
Efficiently generating correction suggestions for garbled tokens of historical language
Natural Language Engineering
International Journal on Document Analysis and Recognition - Special issue on noisy text analytics
Hi-index | 0.00 |
Erroneous tokens in the output of an OCR engine can be roughly divided into two categories. For less serious OCR errors typically human readers - in many cases also text correction systems - are able to reconstruct the correct original word, or to suggest a small set of plausible corrections. Sometimes, however, the OCR output contains "garbage" output tokens for which it is completely impossible to predict the correct word. Garbage tokens are for example caused by graphics occurring in images misinterpreted as text by the OCR engine. In this paper we report on the development of a classifier for garbage tokens in OCR output on historical documents. The classifier is based on a specific feature set and implemented as a support vector machine. In our experiments it clearly outperformed simple rule-based predecessor solutions for OCR garbage detection.