Language Identification of Character Images Using Machine Learning Techniques

Authors:
Ying-Ho Liu;Fu Chang;Chin-Chin Lin
Affiliations:
Institute of Information Science, Academia Sinica, Taipei, Taiwan;Institute of Information Science, Academia Sinica, Taipei, Taiwan;National Taipei University of Technology, Taipei, Taiwan
Venue:
ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Year:
2005

Citing 9
Cited 3

The nature of statistical learning theory

The nature of statistical learning theory
Determination of the Script and Language Content of Document Images

IEEE Transactions on Pattern Analysis and Machine Intelligence
How to read less and know more: approximate OCR for Thai

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Reduction Techniques for Instance-BasedLearning Algorithms

Machine Learning
Pattern Recognition with Fuzzy Objective Function Algorithms

Pattern Recognition with Fuzzy Objective Function Algorithms
Language identification of on-line documents using word shapes

ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
Language identification for printed text independent of segmentation

ICIP '95 Proceedings of the 1995 International Conference on Image Processing (Vol. 3)-Volume 3 - Volume 3
Applying A Hybrid Method To Handwritten Character Recognition

ICPR '04 Proceedings of the Pattern Recognition, 17th International Conference on (ICPR'04) Volume 2 - Volume 02
A comparison of methods for multiclass support vector machines

IEEE Transactions on Neural Networks

Accelerating feature-vector matching using multiple-tree and sub-vector methods

Pattern Recognition
Local features-based script recognition from printed bilingual document images

International Journal of Computer Applications in Technology
Bangla/English script identification based on analysis of connected component profiles

DAS'06 Proceedings of the 7th international conference on Document Analysis Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we propose a new approach for identifying the language type of character images. We do this by classifying individual character images to determine the language boundaries in multilingual documents. Two effective methods are considered for this purpose: the prototype classification method and support vector machines (SVM). Due to the large size of our training dataset, we further propose a technique to speed up the training process for both methods. Applying the two methods to classifying characters into Chinese, English, and Japanese (including Hiragana and Katakana) has produced very accurate and comparable test results.