Optical character recognition errors and their effects on natural language processing

Authors:
Daniel Lopresti
Affiliations:
Lehigh University, Bethlehem, PA
Venue:
Proceedings of the second workshop on Analytics for noisy unstructured text data
Year:
2008

Citing 6
Cited 7

Effects of OCR errors on ranking and feedback using the vector space model

Information Processing and Management: an International Journal
Prediction of OCR accuracy using simple image features

ICDAR '95 Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 1) - Volume 1
Named entity extraction from noisy input: speech and OCR

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
A maximum entropy approach to identifying sentence boundaries

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Performance evaluation for text processing of noisy inputs

Proceedings of the 2005 ACM symposium on Applied computing
Improving information extraction by modeling errors in speech recognizer output

HLT '01 Proceedings of the first international conference on Human language technology research

Tools for monitoring, visualizing, and refining collections of noisy documents

Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
Text retrieval from early printed books

Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
A survey of types of text noise and techniques to handle noisy text

Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
Effect of OCR-errors on the transformation of semi-structured text data into relational database

Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
Evaluating models of latent document semantics in the presence of OCR errors

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Document: a useful level for facing noisy data

AND '10 Proceedings of the fourth workshop on Analytics for noisy unstructured text data
A vector space analysis of swedish patent claims with different linguistic indices

PaIR '10 Proceedings of the 3rd international workshop on Patent information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Errors are unavoidable in advanced computer vision applications such as optical character recognition, and the noise induced by these errors presents a serious challenge to down-stream processes that attempt to make use of such data. In this paper, we apply a new paradigm we have proposed for measuring the impact of recognition errors on the stages of a standard text analysis pipeline: sentence boundary detection, tokenization, and part-of-speech tagging. Our methodology formulates error classification as an optimization problem solvable using a hierarchical dynamic programming approach. Errors and their cascading effects are isolated and analyzed as they travel through the pipeline. We present experimental results based on a large collection of scanned pages to study the varying impact depending on the nature of the error and the character(s) involved. The problem of identifying tabular structures that should not be parsed as sentential text is also discussed.