A panlingual anomalous text detector

Authors:
Ashok C. Popat
Affiliations:
Google, Inc., Mountain View, CA, USA
Venue:
Proceedings of the 9th ACM symposium on Document engineering
Year:
2009

Citing 2
Cited 1

A Winnow-Based Approach to Context-Sensitive Spelling Correction

Machine Learning - Special issue on natural language learning
A generative probabilistic OCR model for NLP applications

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1

Unsupervised profiling of OCRed historical documents

Pattern Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

In a large-scale book scanning operation, material can vary widely in language, script, genre, domain, print quality, and other factors, giving rise to a corresponding variability in the OCRed text. It is often desirable to automatically detect errorful and otherwise anomalous text segments, so that they can be filtered out or appropriately flagged, for such applications as indexing, mining, analyzing, displaying, and selectively re-processing such data. Moreover, it is advantageous to require that the automated detector be independent of the underlying OCR engine (or engines), that it work over a broad range of languages, that it seamlessly handle mixed-language material, and that it accommodate documents that contain domain-specific and otherwise rare terminology. A technique is presented that satisfies these requirements, using an adaptive mixture of character-level N-gram language models. Its design, training, implementation, and evaluation are described within the context of high-volume book scanning.