An Overview of the Tesseract OCR Engine
ICDAR '07 Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 02
A comprehensive evaluation methodology for noisy historical document recognition techniques
Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
Recent progress on the OCRopus OCR system
Proceedings of the International Workshop on Multilingual OCR
Combined script and page orientation estimation using the Tesseract OCR engine
Proceedings of the International Workshop on Multilingual OCR
Open source historical OCR: the OCRopodium project
ECDL'10 Proceedings of the 14th European conference on Research and advanced technology for digital libraries
Hi-index | 0.00 |
Large-scale digitization projects dealing with text-based historical material face challenges that are not well catered for by commercial software. This article discusses the results of a project to build a scalable OCR workflow for historical collections based on open source tools that is particularly tailored towards use in small-scale historical archives. It argues that open source tools allow for better customization to match these requirements, particularly with regard to character model training and per-project language modelling. We offer insights into our accuracy evaluation results of various open source OCR tools, as well as a case study about the challenges and opportunities of open source OCR in historical archives.