Combined script and page orientation estimation using the Tesseract OCR engine

  • Authors:
  • Ranjith Unnikrishnan;Ray Smith

  • Affiliations:
  • Google Inc., Mountain View, CA;Google Inc., Mountain View, CA

  • Venue:
  • Proceedings of the International Workshop on Multilingual OCR
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper proposes a simple but effective algorithm to estimate the script and dominant page orientation of the text contained in an image. A candidate set of shape classes for each script is generated using synthetically rendered text and used to train a fast shape classifier. At run time, the classifier is applied independently to connected components in the image for each possible orientation of the component, and the accumulated confidence scores are used to determine the best estimate of page orientation and script. Results demonstrate the effectiveness of the approach on a dataset of 1846 documents containing a diverse set of images in 14 scripts and any of four possible page orientations. A C++ implementation of this work will be made available in a future release of the open-source Tesseract OCR engine [1].