Improve feature selection method of web page language identification using fuzzy ARTMAP

Authors:
Choon-Ching Ng;Ali Selamat
Affiliations:
Faculty of Computer Science and Information Systems, University of Technology Malaysia (UTM), 81310 Skudai, Johor Bahru, Johor, Malaysia.;Faculty of Computer Science and Information Systems, University of Technology Malaysia (UTM), 81310 Skudai, Johor Bahru, Johor, Malaysia
Venue:
International Journal of Intelligent Information and Database Systems
Year:
2010

Citing 10
Cited 1

Machine Learning

Machine Learning
Computer Networks

Computer Networks
Character N-Gram Tokenization for European Language Text Retrieval

Information Retrieval
Web page feature selection and classification using neural networks

Information Sciences—Informatics and Computer Science: An International Journal - Special issue: Informatics and computer science intelligent systems applications
Dictionary-based techniques for cross-language information retrieval

Information Processing and Management: an International Journal - Special issue: Cross-language information retrieval
Language identification in web pages

Proceedings of the 2005 ACM symposium on Applied computing
Barriers to Information Access across Languages on the Internet: Network and Language Effects

HICSS '06 Proceedings of the 39th Annual Hawaii International Conference on System Sciences - Volume 03
Multilingual ICT education: language observatory as a monitoring instrument

SEARCC '05 Proceedings of the 2005 South East Asia Regional Computer Science Confederation (SEARCC) Conference - Volume 46
An Iterative Relative Entropy Minimization-Based Data Selection Approach for n-Gram Model Adaptation

IEEE Transactions on Audio, Speech, and Language Processing
Importance of High-Order N-Gram Models in Morph-Based Speech Recognition

IEEE Transactions on Audio, Speech, and Language Processing

Improved N-grams approach for web page language identification

Transactions on computational collective intelligence V

Quantified Score

Hi-index	0.00

Visualization

Abstract

The information available in languages other than English on the World Wide Web and global information systems is increasing significantly. Different languages can be produced by using one particular script such as Arabic, Persian, Urdu and Pashto that use Arabic script letters. The issue is how to produce reliable features of a web page that is to undergo language identification. Incorrectly identifying the language results in garbled translations as well as faulty and incomplete analyses. The aim of this study is to enhance the effectiveness of feature selection method of web page language identification. We have investigated total N-grams, N-grams frequency, N-grams frequency document frequency, and N-grams frequency inverse document frequency of web page language identification. From the experimental results, it is proven that N-grams frequency gives the most promising result compared to other feature selection methods.