Effects of OCR errors on ranking and feedback using the vector space model
Information Processing and Management: an International Journal
Prediction of OCR accuracy using simple image features
ICDAR '95 Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 1) - Volume 1
Named entity extraction from noisy input: speech and OCR
ANLC '00 Proceedings of the sixth conference on Applied natural language processing
A maximum entropy approach to identifying sentence boundaries
ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Performance evaluation for text processing of noisy inputs
Proceedings of the 2005 ACM symposium on Applied computing
Improving information extraction by modeling errors in speech recognizer output
HLT '01 Proceedings of the first international conference on Human language technology research
Tools for monitoring, visualizing, and refining collections of noisy documents
Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
Text retrieval from early printed books
Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
A survey of types of text noise and techniques to handle noisy text
Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
Effect of OCR-errors on the transformation of semi-structured text data into relational database
Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
Evaluating models of latent document semantics in the presence of OCR errors
EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Document: a useful level for facing noisy data
AND '10 Proceedings of the fourth workshop on Analytics for noisy unstructured text data
A vector space analysis of swedish patent claims with different linguistic indices
PaIR '10 Proceedings of the 3rd international workshop on Patent information retrieval
Hi-index | 0.00 |
Errors are unavoidable in advanced computer vision applications such as optical character recognition, and the noise induced by these errors presents a serious challenge to down-stream processes that attempt to make use of such data. In this paper, we apply a new paradigm we have proposed for measuring the impact of recognition errors on the stages of a standard text analysis pipeline: sentence boundary detection, tokenization, and part-of-speech tagging. Our methodology formulates error classification as an optimization problem solvable using a hierarchical dynamic programming approach. Errors and their cascading effects are isolated and analyzed as they travel through the pipeline. We present experimental results based on a large collection of scanned pages to study the varying impact depending on the nature of the error and the character(s) involved. The problem of identifying tabular structures that should not be parsed as sentential text is also discussed.