Script and language identification in degraded and distorted document images

Authors:
Shijian Lu;Chew Lim Tan
Affiliations:
Department of Computer Science, National University of Singapore;Department of Computer Science, National University of Singapore
Venue:
AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
Year:
2006

Citing 7
Cited 5

Determination of the Script and Language Content of Document Images

IEEE Transactions on Pattern Analysis and Machine Intelligence
Automatic Script Identification From Document Images Using Cluster-Based Templates

IEEE Transactions on Pattern Analysis and Machine Intelligence
Rotation Invariant Texture Features and Their Use in Automatic Script Identification

IEEE Transactions on Pattern Analysis and Machine Intelligence
Language identification of on-line documents using word shapes

ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
Classification of Oriental and European Scripts by Using Characteristic Features

ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
Texture for Script Identification

IEEE Transactions on Pattern Analysis and Machine Intelligence
Page segmentation using texture analysis

Pattern Recognition

Automatic document orientation detection and categorization through document vectorization

MULTIMEDIA '06 Proceedings of the 14th annual ACM international conference on Multimedia
Script and Language Identification in Noisy and Degraded Document Images

IEEE Transactions on Pattern Analysis and Machine Intelligence
Retrieval of machine-printed Latin documents through Word Shape Coding

Pattern Recognition
Retrieval of machine-printed Latin documents through Word Shape Coding

Pattern Recognition
Local features-based script recognition from printed bilingual document images

International Journal of Computer Applications in Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper reports a statistical identification technique that differentiates scripts and languages in degraded and distorted document images. We identify scripts and languages through document vectorization, which transforms each document image into an electronic document vector that characterizes the shape and frequency of the contained character and word images. We first identify scripts based on the density and distribution of vertical runs between character strokes and a vertical scan line. Latin-based languages are then differentiated using a set of word shape codes constructed using horizontal word runs and character extremum points. Experimental results show that our method is tolerant to noise, document degradation, and slight document skew and attains an average identification rate over 95%.