Script and Language Identification in Noisy and Degraded Document Images

Authors:
Lu Shijian;Chew Lim Tan
Affiliations:
-;-
Venue:
IEEE Transactions on Pattern Analysis and Machine Intelligence
Year:
2008

Citing 13
Cited 6

Connected components in binary images: the detection problem

Connected components in binary images: the detection problem
Evaluation of Binarization Methods for Document Images

IEEE Transactions on Pattern Analysis and Machine Intelligence
Determination of the Script and Language Content of Document Images

IEEE Transactions on Pattern Analysis and Machine Intelligence
Automatic Script Identification From Document Images Using Cluster-Based Templates

IEEE Transactions on Pattern Analysis and Machine Intelligence
Rotation Invariant Texture Features and Their Use in Automatic Script Identification

IEEE Transactions on Pattern Analysis and Machine Intelligence
Language identification of on-line documents using word shapes

ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
Classification of Oriental and European Scripts by Using Characteristic Features

ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
Automatic script identification from images using cluster-based templates

ICDAR '95 Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 1) - Volume 1
Techniques for Language Identification for Hybrid Arabic-English Document Images

ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
Texture for Script Identification

IEEE Transactions on Pattern Analysis and Machine Intelligence
Script and language identification in degraded and distorted document images

AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
Page segmentation using texture analysis

Pattern Recognition
Language identification in degraded and distorted document images

DAS'06 Proceedings of the 7th international conference on Document Analysis Systems

Retrieval of machine-printed Latin documents through Word Shape Coding

Pattern Recognition
Retrieval of machine-printed Latin documents through Word Shape Coding

Pattern Recognition
Word-Wise Thai and Roman Script Identification

ACM Transactions on Asian Language Information Processing (TALIP)
Language identification for handwritten document images using a shape codebook

Pattern Recognition
Local features-based script recognition from printed bilingual document images

International Journal of Computer Applications in Technology
Texture feature evaluation for segmentation of historical document images

Proceedings of the 2nd International Workshop on Historical Document Imaging and Processing

Quantified Score

Hi-index	0.14

Visualization

Abstract

This paper reports an identification technique that detects scripts and languages of noisy and degraded document images. In the proposed technique, scripts and languages are identified through the document vectorization, which converts each document image into a document vector that characterizes the shape and frequency of the conta ned character or word images. Document images are vectorized by using vertical component cuts and character extremum points, which are both tolerant to the variation in text fonts and styles, noise, and various types of document degradation. For each script or language under study, a script or language template is first constructed through a training process. Scripts and languages of document images are then determined according to the distances between converted document vectors and the pre-constructed script and language templates. Experimental results show that the proposed technique is accurate, easy for extension, and tolerant to noise and various types of document degradation.