Combined script and page orientation estimation using the Tesseract OCR engine

Authors:
Ranjith Unnikrishnan;Ray Smith
Affiliations:
Google Inc., Mountain View, CA;Google Inc., Mountain View, CA
Venue:
Proceedings of the International Workshop on Multilingual OCR
Year:
2009

Citing 9
Cited 2

Determination of the Script and Language Content of Document Images

IEEE Transactions on Pattern Analysis and Machine Intelligence
Automatic Script Identification From Document Images Using Cluster-Based Templates

IEEE Transactions on Pattern Analysis and Machine Intelligence
Trainable Script Identification Strategies for Indian Languages

ICDAR '99 Proceedings of the Fifth International Conference on Document Analysis and Recognition
Gabor Filter Based Multi-class Classifier for Scanned Document Images

ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 2
Texture for Script Identification

IEEE Transactions on Pattern Analysis and Machine Intelligence
An Overview of the Tesseract OCR Engine

ICDAR '07 Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 02
Learning "forgiving" hash functions: algorithms and large scale tests

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Hybrid Page Layout Analysis via Tab-Stop Detection

ICDAR '09 Proceedings of the 2009 10th International Conference on Document Analysis and Recognition
Bangla/English script identification based on analysis of connected component profiles

DAS'06 Proceedings of the 7th international conference on Document Analysis Systems

Adapting the Tesseract open source OCR engine for multilingual OCR

Proceedings of the International Workshop on Multilingual OCR
Ocropodium: open source OCR for small-scale historical archives

Journal of Information Science

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes a simple but effective algorithm to estimate the script and dominant page orientation of the text contained in an image. A candidate set of shape classes for each script is generated using synthetically rendered text and used to train a fast shape classifier. At run time, the classifier is applied independently to connected components in the image for each possible orientation of the component, and the accumulated confidence scores are used to determine the best estimate of page orientation and script. Results demonstrate the effectiveness of the approach on a dataset of 1846 documents containing a diverse set of images in 14 scripts and any of four possible page orientations. A C++ implementation of this work will be made available in a future release of the open-source Tesseract OCR engine [1].