Language identification in degraded and distorted document images

Authors:
Shijian Lu;Chew Lim Tan;Weihua Huang
Affiliations:
School of Computing, National University of Singapore, Singapore;School of Computing, National University of Singapore, Singapore;School of Computing, National University of Singapore, Singapore
Venue:
DAS'06 Proceedings of the 7th international conference on Document Analysis Systems
Year:
2006

Citing 5
Cited 1

Determination of the Script and Language Content of Document Images

IEEE Transactions on Pattern Analysis and Machine Intelligence
Automatic Script Identification From Document Images Using Cluster-Based Templates

IEEE Transactions on Pattern Analysis and Machine Intelligence
Rotation Invariant Texture Features and Their Use in Automatic Script Identification

IEEE Transactions on Pattern Analysis and Machine Intelligence
Language identification of on-line documents using word shapes

ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
Perspective rectification of document images using fuzzy set and morphological operations

Image and Vision Computing

Script and Language Identification in Noisy and Degraded Document Images

IEEE Transactions on Pattern Analysis and Machine Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a language identification technique that differentiates Latin-based languages in degraded and distorted document images. Different from the reported methods that transform word images through a character shape coding process, our method directly captures word shapes with the local extremum points and the horizontal intersection numbers, which are both tolerant of noise, character segmentation errors, and slight skew distortions. For each language studied, a word shape template and a word frequency template are firstly constructed based on the proposed word shape coding scheme. Identification is then accomplished based on Bray Curtis or Hamming distance between the word shape code of query images and the constructed word shape and frequency templates. Experiments show the average identification rate upon eight Latin-based languages reaches over 99%. ...