Ocropodium: open source OCR for small-scale historical archives

Authors:
Tobias Blanke;Michael Bryant;Mark Hedges
Affiliations:
King's College London, UK;King's College London, UK;King's College London, UK
Venue:
Journal of Information Science
Year:
2012

Citing 5
Cited 0

An Overview of the Tesseract OCR Engine

ICDAR '07 Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 02
A comprehensive evaluation methodology for noisy historical document recognition techniques

Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
Recent progress on the OCRopus OCR system

Proceedings of the International Workshop on Multilingual OCR
Combined script and page orientation estimation using the Tesseract OCR engine

Proceedings of the International Workshop on Multilingual OCR
Open source historical OCR: the OCRopodium project

ECDL'10 Proceedings of the 14th European conference on Research and advanced technology for digital libraries

Quantified Score

Hi-index	0.00

Visualization

Abstract

Large-scale digitization projects dealing with text-based historical material face challenges that are not well catered for by commercial software. This article discusses the results of a project to build a scalable OCR workflow for historical collections based on open source tools that is particularly tailored towards use in small-scale historical archives. It argues that open source tools allow for better customization to match these requirements, particularly with regard to character model training and per-project language modelling. We offer insights into our accuracy evaluation results of various open source OCR tools, as well as a case study about the challenges and opportunities of open source OCR in historical archives.