Evaluating models of latent document semantics in the presence of OCR errors

Authors:
Daniel D. Walker;William B. Lund;Eric K. Ringger
Affiliations:
Brigham Young University, Provo, Utah;Brigham Young University, Provo, Utah;Brigham Young University, Provo, Utah
Venue:
EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Year:
2010

Citing 16
Cited 6

Results of applying probabilistic IR to OCR text

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Recognition of Cursive Roman Handwriting - Past, Present and Future

ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
Latent dirichlet allocation

The Journal of Machine Learning Research
Probabilistic topic decomposition of an eighteenth-century American newspaper

Journal of the American Society for Information Science and Technology
The effect of speech recognition accuracy rates on the usefulness and usability of webcast archives

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Dynamic topic models

ICML '06 Proceedings of the 23rd international conference on Machine learning
Topics over time: a non-Markov continuous-time model of topical trends

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Comparing clusterings---an information based distance

Journal of Multivariate Analysis
Organizing the OCA: learning faceted subjects from a library of digital books

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Context-Sensitive Error Correction: Using Topic Models to Improve OCR

ICDAR '07 Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 02
Optical character recognition errors and their effects on natural language processing

Proceedings of the second workshop on Analytics for noisy unstructured text data
Model-based document clustering with a collapsed gibbs sampler

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
How Much Noise Is Too Much: A Study in Automatic Text Classification

ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
Evaluation methods for topic models

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Improving optical character recognition through efficient multiple system alignment

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Using topic models for OCR correction

International Journal on Document Analysis and Recognition - Special Issue NOISY

Towards noise-resilient document modeling

Proceedings of the 20th ACM international conference on Information and knowledge management
Topic modeling on historical newspapers

LaTeCH '11 Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities
Measuring contextual fitness using error contexts extracted from the Wikipedia revision history

EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
Content level access to digital library of India pages

Proceedings of the Eighth Indian Conference on Computer Vision, Graphics and Image Processing
On handling textual errors in latent document modeling

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
A study on document retrieval system based on visualization to manage OCR documents

HCI'13 Proceedings of the 15th international conference on Human-Computer Interaction: interaction modalities and techniques - Volume Part IV

Quantified Score

Hi-index	0.00

Visualization

Abstract

Models of latent document semantics such as the mixture of multinomials model and Latent Dirichlet Allocation have received substantial attention for their ability to discover topical semantics in large collections of text. In an effort to apply such models to noisy optical character recognition (OCR) text output, we endeavor to understand the effect that character-level noise can have on unsupervised topic modeling. We show the effects both with document-level topic analysis (document clustering) and with word-level topic analysis (LDA) on both synthetic and real-world OCR data. As expected, experimental results show that performance declines as word error rates increase. Common techniques for alleviating these problems, such as filtering low-frequency words, are successful in enhancing model quality, but exhibit failure trends similar to models trained on unprocessed OCR output in the case of LDA. To our knowledge, this study is the first of its kind.