Techniques for automatically correcting words in text
ACM Computing Surveys (CSUR)
A generative probabilistic OCR model for NLP applications
NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Hi-index | 0.00 |
Our current research effort aims at building a filter based post-OCR accuracy boost system that will combine different post-OCR correction filters to improve the OCR accuracy better than each individual filter can. In this paper we focus on a Hidden Markov Model (HMM) based accuracy booster modeling OCR engine noise generation as a two-layer stochastic process. We employ a commercial spell-checker both as another error correction filter and as a base line for accuracy boost comparison. We demonstrate the versatility of our approach in experiments with documents in English and Arabic.