Determination of the Script and Language Content of Document Images

Authors:
A. Lawrence Spitz
Affiliations:
-
Venue:
IEEE Transactions on Pattern Analysis and Machine Intelligence
Year:
1997

Citing 0
Cited 33

Rotation Invariant Texture Features and Their Use in Automatic Script Identification

IEEE Transactions on Pattern Analysis and Machine Intelligence
Twenty Years of Document Image Analysis in PAMI

IEEE Transactions on Pattern Analysis and Machine Intelligence
Imaged Document Text Retrieval Without OCR

IEEE Transactions on Pattern Analysis and Machine Intelligence
A Recognition System for Devnagri and English Handwritten Numerals

ICMI '00 Proceedings of the Third International Conference on Advances in Multimodal Interfaces
Script Identification in Printed Bilingual Documents

DAS '02 Proceedings of the 5th International Workshop on Document Analysis Systems V
Gabor Filter Based Multi-class Classifier for Scanned Document Images

ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 2
Multi-Script Line identification from Indian Documents

ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 2
Online Handwritten Script Recognition

IEEE Transactions on Pattern Analysis and Machine Intelligence
Texture for Script Identification

IEEE Transactions on Pattern Analysis and Machine Intelligence
Identifying Script onWord-Level with Informational Confidenc

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Language Identification of Character Images Using Machine Learning Techniques

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Script Identification Using Steerable Gabor Filters

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Automatic document orientation detection and categorization through document vectorization

MULTIMEDIA '06 Proceedings of the 14th annual ACM international conference on Multimedia
Script and Language Identification in Noisy and Degraded Document Images

IEEE Transactions on Pattern Analysis and Machine Intelligence
Word level multi-script identification

Pattern Recognition Letters
Word-Wise Thai and Roman Script Identification

ACM Transactions on Asian Language Information Processing (TALIP)
Curvature feature distribution based classification of Indian scripts from document images

Proceedings of the International Workshop on Multilingual OCR
Combined script and page orientation estimation using the Tesseract OCR engine

Proceedings of the International Workshop on Multilingual OCR
Orientation detection of major Indian scripts

Proceedings of the International Workshop on Multilingual OCR
Language identification for handwritten document images using a shape codebook

Pattern Recognition
Script and language identification in degraded and distorted document images

AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
Local features-based script recognition from printed bilingual document images

International Journal of Computer Applications in Technology
Word level identification of Kannada, Hindi and English scripts from a tri-lingual document

International Journal of Computational Vision and Robotics
Document image analysis: issues, comparison of methods and remaining problems

Artificial Intelligence Review
A survey of keyword spotting techniques for printed document images

Artificial Intelligence Review
Script based text identification: a multi-level architecture

Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data
Multi-font script identification using texture-based features

ICIAR'06 Proceedings of the Third international conference on Image Analysis and Recognition - Volume Part II
Language identification in degraded and distorted document images

DAS'06 Proceedings of the 7th international conference on Document Analysis Systems
Bangla/English script identification based on analysis of connected component profiles

DAS'06 Proceedings of the 7th international conference on Document Analysis Systems
Script identification from indian documents

DAS'06 Proceedings of the 7th international conference on Document Analysis Systems
Exploratory analysis system for semi-structured engineering logs

DAS'06 Proceedings of the 7th international conference on Document Analysis Systems
HVS inspired system for script identification in indian multi-script documents

DAS'06 Proceedings of the 7th international conference on Document Analysis Systems
Performance analysis of feature extractors and classifiers for script recognition of English and Gurmukhi words

Proceeding of the workshop on Document Analysis and Recognition

Quantified Score

Hi-index	0.14

Visualization

Abstract

Most document recognition work to date has been performed on English text. Because of the large overlap of the character sets found in English and major Western European languages such as French and German, some extensions of the basic English capability to those languages have taken place. However, automatic language identification prior to optical character recognition is not commonly available and adds utility to such systems.Languages and their scripts have attributes that make it possible to determine the language of a document automatically. Detection of the values of these attributes requires the recognition of particular features of the document image and, in the case of languages using Latin-based symbols, the character syntax of the underlying language.We have developed techniques for distinguishing which language is represented in an image of text. This work is restricted to a small but important subset of the world's languages. The method first classifies the script into two broad classes: Han-based and Latin-based. This classification is based on the spatial relationships of features related to the upward concavities in character structures. Language identification within the Han script class (Chinese, Japanese, Korean) is performed by analysis of the distribution of optical density in the text images. We handle 23 Latin-based languages using a technique based on character shape codes, a representation of Latin text that is inexpensive to compute.