Applying the OCRopus OCR System to Scholarly Sanskrit Literature

Authors:
Thomas M. Breuel
Affiliations:
DFKI and University of Kaiserslautern, Kaiserslautern, Germany
Venue:
Sanskrit Computational Linguistics
Year:
2009

Citing 4
Cited 0

Segmentation of Handprinted Letter Strings Using a Dynamic Programming Algorithm

ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
A weighted finite state transducer implementation of the alignment template model for statistical machine translation

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Bibliographic Meta-Data Extraction Using Probabilistic Finite State Transducers

ICDAR '07 Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 02
An Overview of the Tesseract OCR Engine

ICDAR '07 Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 02

Quantified Score

Hi-index	0.00

Visualization

Abstract

OCRopus is an open source OCR system currently being developed, intended to be omni-lingual and omni-script. In addition to modern digital library applications, applications of the system include capturing and recognizing classical literature, as well as the large body of research literature about classics. OCRopus advances the state of the art in a number of ways, including the ability easily to plug in new text recognition and layout analysis modules, the use of adaptive and user extensible character recognition, and statistical and trainable layout analysis. Of particular interest for computational linguistics applications is the consistent use of probability estimates throughout the system and the use of weighted finite state transducers to represent both alternative recognition hypotheses and statistical language models. In this paper, I first give an overview of these technologies and their relevance to digital library applications in the humanities, and then focus on the use of statistical language models and their use for the integration of OCR output with subsequent computational linguistic and information extraction modules.