Determination of the Script and Language Content of Document Images
IEEE Transactions on Pattern Analysis and Machine Intelligence
Automatic Script Identification From Document Images Using Cluster-Based Templates
IEEE Transactions on Pattern Analysis and Machine Intelligence
Trainable Script Identification Strategies for Indian Languages
ICDAR '99 Proceedings of the Fifth International Conference on Document Analysis and Recognition
Gabor Filter Based Multi-class Classifier for Scanned Document Images
ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 2
Texture for Script Identification
IEEE Transactions on Pattern Analysis and Machine Intelligence
An Overview of the Tesseract OCR Engine
ICDAR '07 Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 02
Learning "forgiving" hash functions: algorithms and large scale tests
IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Hybrid Page Layout Analysis via Tab-Stop Detection
ICDAR '09 Proceedings of the 2009 10th International Conference on Document Analysis and Recognition
Bangla/English script identification based on analysis of connected component profiles
DAS'06 Proceedings of the 7th international conference on Document Analysis Systems
Adapting the Tesseract open source OCR engine for multilingual OCR
Proceedings of the International Workshop on Multilingual OCR
Ocropodium: open source OCR for small-scale historical archives
Journal of Information Science
Hi-index | 0.00 |
This paper proposes a simple but effective algorithm to estimate the script and dominant page orientation of the text contained in an image. A candidate set of shape classes for each script is generated using synthetically rendered text and used to train a fast shape classifier. At run time, the classifier is applied independently to connected components in the image for each possible orientation of the component, and the accumulated confidence scores are used to determine the best estimate of page orientation and script. Results demonstrate the effectiveness of the approach on a dataset of 1846 documents containing a diverse set of images in 14 scripts and any of four possible page orientations. A C++ implementation of this work will be made available in a future release of the open-source Tesseract OCR engine [1].