Script and Language Identification from Document Images

  • Authors:
  • G. S. Peake;T. N. Tan

  • Affiliations:
  • -;-

  • Venue:
  • DIA '97 Proceedings of the 1997 Workshop on Document Image Analysis
  • Year:
  • 1997

Quantified Score

Hi-index 0.01

Visualization

Abstract

In this paper we present a detailed review of current script and language identification techniques. The main criticism of the existing techniques is that most of them rely on character segmentation. We go on to present a new method based on texture analysis for script identification which does not require character segmentation. A uniform text block on which texture analysis can be performed is produced from a document image via simple processing. Multiple channel (Gabor) filters and grey level co-occurrence matrices are used in independent experiments in order to extract texture features. Classification of test documents is made based on the features of training documents using the K-NN classifier. Initial results of over 95% accuracy on the classification of 105 test documents from 7 languages are very promising. The method shows robustness with respect to noise, the presence of foreign characters or numerals, and can be applied to very small amounts of text.