Results of applying probabilistic IR to OCR text
SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Recognition of Cursive Roman Handwriting - Past, Present and Future
ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
The Journal of Machine Learning Research
Probabilistic topic decomposition of an eighteenth-century American newspaper
Journal of the American Society for Information Science and Technology
The effect of speech recognition accuracy rates on the usefulness and usability of webcast archives
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
ICML '06 Proceedings of the 23rd international conference on Machine learning
Topics over time: a non-Markov continuous-time model of topical trends
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Comparing clusterings---an information based distance
Journal of Multivariate Analysis
Organizing the OCA: learning faceted subjects from a library of digital books
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Context-Sensitive Error Correction: Using Topic Models to Improve OCR
ICDAR '07 Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 02
Optical character recognition errors and their effects on natural language processing
Proceedings of the second workshop on Analytics for noisy unstructured text data
Model-based document clustering with a collapsed gibbs sampler
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
How Much Noise Is Too Much: A Study in Automatic Text Classification
ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
Evaluation methods for topic models
ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Improving optical character recognition through efficient multiple system alignment
Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Using topic models for OCR correction
International Journal on Document Analysis and Recognition - Special Issue NOISY
Towards noise-resilient document modeling
Proceedings of the 20th ACM international conference on Information and knowledge management
Topic modeling on historical newspapers
LaTeCH '11 Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities
Measuring contextual fitness using error contexts extracted from the Wikipedia revision history
EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
Content level access to digital library of India pages
Proceedings of the Eighth Indian Conference on Computer Vision, Graphics and Image Processing
On handling textual errors in latent document modeling
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
A study on document retrieval system based on visualization to manage OCR documents
HCI'13 Proceedings of the 15th international conference on Human-Computer Interaction: interaction modalities and techniques - Volume Part IV
Hi-index | 0.00 |
Models of latent document semantics such as the mixture of multinomials model and Latent Dirichlet Allocation have received substantial attention for their ability to discover topical semantics in large collections of text. In an effort to apply such models to noisy optical character recognition (OCR) text output, we endeavor to understand the effect that character-level noise can have on unsupervised topic modeling. We show the effects both with document-level topic analysis (document clustering) and with word-level topic analysis (LDA) on both synthetic and real-world OCR data. As expected, experimental results show that performance declines as word error rates increase. Common techniques for alleviating these problems, such as filtering low-frequency words, are successful in enhancing model quality, but exhibit failure trends similar to models trained on unprocessed OCR output in the case of LDA. To our knowledge, this study is the first of its kind.