A Winnow-Based Approach to Context-Sensitive Spelling Correction
Machine Learning - Special issue on natural language learning
A generative probabilistic OCR model for NLP applications
NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Unsupervised profiling of OCRed historical documents
Pattern Recognition
Hi-index | 0.00 |
In a large-scale book scanning operation, material can vary widely in language, script, genre, domain, print quality, and other factors, giving rise to a corresponding variability in the OCRed text. It is often desirable to automatically detect errorful and otherwise anomalous text segments, so that they can be filtered out or appropriately flagged, for such applications as indexing, mining, analyzing, displaying, and selectively re-processing such data. Moreover, it is advantageous to require that the automated detector be independent of the underlying OCR engine (or engines), that it work over a broad range of languages, that it seamlessly handle mixed-language material, and that it accommodate documents that contain domain-specific and otherwise rare terminology. A technique is presented that satisfies these requirements, using an adaptive mixture of character-level N-gram language models. Its design, training, implementation, and evaluation are described within the context of high-volume book scanning.