Arabic script web page language identifications using decision tree neural networks

Authors:
A. Selamat;C. C. Ng
Affiliations:
Intelligent Software Engineering Laboratory, Faculty of Computer Science and Information Systems, Universiti Teknologi Malaysia, 81310 UTM Skudai, Johor, Malaysia;Intelligent Software Engineering Laboratory, Faculty of Computer Science and Information Systems, Universiti Teknologi Malaysia, 81310 UTM Skudai, Johor, Malaysia
Venue:
Pattern Recognition
Year:
2011

Citing 29
Cited 1

Evaluating and optimizing autonomous text classification systems

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Neural networks for language identification: a comparative study

Information Processing and Management: an International Journal
Bayesian online classifiers for text classification and filtering

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
A language and character set determination method based on N-gram statistics

ACM Transactions on Asian Language Information Processing (TALIP)
Character N-Gram Tokenization for European Language Text Retrieval

Information Retrieval
Web page feature selection and classification using neural networks

Information Sciences—Informatics and Computer Science: An International Journal - Special issue: Informatics and computer science intelligent systems applications
An English to Korean transliteration model of extended Markov window

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Language identification in web pages

Proceedings of the 2005 ACM symposium on Applied computing
Language and task independent text categorization with simple language models

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Evaluation of a language identification system for mono- and multilingual text documents

Proceedings of the 2006 ACM symposium on Applied computing
Feature subset selection bias for classification learning

ICML '06 Proceedings of the 23rd international conference on Machine learning
Multilingual ICT education: language observatory as a monitoring instrument

SEARCC '05 Proceedings of the 2005 South East Asia Regional Computer Science Confederation (SEARCC) Conference - Volume 46
Computational Methods of Feature Selection (Chapman & Hall/Crc Data Mining and Knowledge Discovery Series)

Computational Methods of Feature Selection (Chapman & Hall/Crc Data Mining and Knowledge Discovery Series)
Neural Networks: A Comprehensive Foundation (3rd Edition)

Neural Networks: A Comprehensive Foundation (3rd Edition)
Imbalanced text classification: A term weighting approach

Expert Systems with Applications: An International Journal
Construction of supervised and unsupervised learning systems for multilingual text categorization

Expert Systems with Applications: An International Journal
Text feature selection using ant colony optimization

Expert Systems with Applications: An International Journal
Automated multi-label text categorization with VG-RAM weightless neural networks

Neurocomputing
Personalized text snippet extraction using statistical language models

Pattern Recognition
A study of cross-validation and bootstrap for accuracy estimation and model selection

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2
Automatic text categorization based on content analysis with cognitive situation models

Information Sciences: an International Journal
Robust language identification based on fused phonotactic information with MLKSFM pre-classifier

ICME'09 Proceedings of the 2009 IEEE international conference on Multimedia and Expo
An automatic language identification method based on subspace analysis

ICME'09 Proceedings of the 2009 IEEE international conference on Multimedia and Expo
An effective refinement strategy for KNN text classifier

Expert Systems with Applications: An International Journal
Text classification using graph mining-based feature extraction

Knowledge-Based Systems
Analytical evaluation of term weighting schemes for text categorization

Pattern Recognition Letters
Letter based text scoring method for language identification

ADVIS'04 Proceedings of the Third international conference on Advances in Information Systems
Disentangling from babylonian confusion – unsupervised language identification

CICLing'05 Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing
A Vector Space Modeling Approach to Spoken Language Identification

IEEE Transactions on Audio, Speech, and Language Processing

Improved N-grams approach for web page language identification

Transactions on computational collective intelligence V

Quantified Score

Hi-index	0.01

Visualization

Abstract

In this paper, we propose a hybrid approach of Arabic scripts web page language identification based on decision tree and ARTMAP approaches. We use the decision tree approach to find the general identities of a web document, be it an Arabic script-based or a non-Arabic-based. Then, we use the selected representations of identified pages from the decision tree approach as an input to the ARTMAP neural network for further verification of the diversity of languages detected by the algorithm. From our initial experiments, we found that, although the decision tree approach may achieve a higher accuracy than ARTMAP, the former may not be as reliable as the ARTMAP approach if the language used is extended to other types of Arabic script web documents in different languages (e.g., Urdu, Arabic, Persian, etc.). Therefore, we propose this hybrid decision tree-ARTMAP approach in order to improve the performance of the Arabic script language identification on web documents in a variety of languages. The result shows that the proposed approach has outperformed both decision tree and the default ARTMAP approaches.