Word level multi-script identification

Authors:
Peeta Basa Pati;A. G. Ramakrishnan
Affiliations:
MILE Laboratory, Department of Electrical Engineering, Indian Institute of Science, Bangalore 560 012, Karnataka, India;MILE Laboratory, Department of Electrical Engineering, Indian Institute of Science, Bangalore 560 012, Karnataka, India
Venue:
Pattern Recognition Letters
Year:
2008

Citing 16
Cited 7

Determination of the Script and Language Content of Document Images

IEEE Transactions on Pattern Analysis and Machine Intelligence
Automatic Script Identification From Document Images Using Cluster-Based Templates

IEEE Transactions on Pattern Analysis and Machine Intelligence
Rotation Invariant Texture Features and Their Use in Automatic Script Identification

IEEE Transactions on Pattern Analysis and Machine Intelligence
A Tutorial on Support Vector Machines for Pattern Recognition

Data Mining and Knowledge Discovery
Page Layout Analyser for Multilingual Indian Documents

LEC '02 Proceedings of the Language Engineering Conference (LEC'02)
Trainable Script Identification Strategies for Indian Languages

ICDAR '99 Proceedings of the Fifth International Conference on Document Analysis and Recognition
Script Line Separation from Indian Multi-Script Documents

ICDAR '99 Proceedings of the Fifth International Conference on Document Analysis and Recognition
Language identification for printed text independent of segmentation

ICIP '95 Proceedings of the 1995 International Conference on Image Processing (Vol. 3)-Volume 3 - Volume 3
Gabor Filter Based Multi-class Classifier for Scanned Document Images

ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 2
Automatic Feature Selection with Applications to Script Identification of Degraded Documents

ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 2
Gabor Filter Based Block Energy Analysis for Text Extraction from Digital Document Images

DIAL '04 Proceedings of the First International Workshop on Document Image Analysis for Libraries (DIAL'04)
Identifying Script onWord-Level with Informational Confidenc

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Text localization and extraction from complex color images

ISVC'05 Proceedings of the First international conference on Advances in Visual Computing
Script identification from indian documents

DAS'06 Proceedings of the 7th international conference on Document Analysis Systems
HVS inspired system for script identification in indian multi-script documents

DAS'06 Proceedings of the 7th international conference on Document Analysis Systems
DCT-based motion estimation

IEEE Transactions on Image Processing

Script based text identification: a multi-level architecture

Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data
Recognition of Kannada characters extracted from scene images

Proceeding of the workshop on Document Analysis and Recognition
Performance analysis of feature extractors and classifiers for script recognition of English and Gurmukhi words

Proceeding of the workshop on Document Analysis and Recognition
A data acquisition and analysis system for palm leaf documents in Telugu

Proceeding of the workshop on Document Analysis and Recognition
A bilingual Gurmukhi-English OCR based on multiple script identifiers and language models

Proceedings of the 4th International Workshop on Multilingual OCR
HMM-based script identification for OCR

Proceedings of the 4th International Workshop on Multilingual OCR
Recognition of Bangla compound characters using structural decomposition

Pattern Recognition

Quantified Score

Hi-index	0.10

Visualization

Abstract

We report an algorithm to identify the script of each word in a document image. We start with a bi-script scenario which is later extended to tri-script and then to eleven-script scenarios. A database of 20,000 words of different font styles and sizes has been collected and used for each script. Effectiveness of Gabor and discrete cosine transform (DCT) features has been independently evaluated using nearest neighbor, linear discriminant and support vector machines (SVM) classifiers. The combination of Gabor features with nearest neighbor or SVM classifier shows promising results; i.e., over 98% for bi-script and tri-script cases and above 89% for the eleven-script scenario.