Machine Learning
Computer Networks
Character N-Gram Tokenization for European Language Text Retrieval
Information Retrieval
Web page feature selection and classification using neural networks
Information Sciences—Informatics and Computer Science: An International Journal - Special issue: Informatics and computer science intelligent systems applications
Dictionary-based techniques for cross-language information retrieval
Information Processing and Management: an International Journal - Special issue: Cross-language information retrieval
Language identification in web pages
Proceedings of the 2005 ACM symposium on Applied computing
Barriers to Information Access across Languages on the Internet: Network and Language Effects
HICSS '06 Proceedings of the 39th Annual Hawaii International Conference on System Sciences - Volume 03
Multilingual ICT education: language observatory as a monitoring instrument
SEARCC '05 Proceedings of the 2005 South East Asia Regional Computer Science Confederation (SEARCC) Conference - Volume 46
An Iterative Relative Entropy Minimization-Based Data Selection Approach for n-Gram Model Adaptation
IEEE Transactions on Audio, Speech, and Language Processing
Importance of High-Order N-Gram Models in Morph-Based Speech Recognition
IEEE Transactions on Audio, Speech, and Language Processing
Improved N-grams approach for web page language identification
Transactions on computational collective intelligence V
Hi-index | 0.00 |
The information available in languages other than English on the World Wide Web and global information systems is increasing significantly. Different languages can be produced by using one particular script such as Arabic, Persian, Urdu and Pashto that use Arabic script letters. The issue is how to produce reliable features of a web page that is to undergo language identification. Incorrectly identifying the language results in garbled translations as well as faulty and incomplete analyses. The aim of this study is to enhance the effectiveness of feature selection method of web page language identification. We have investigated total N-grams, N-grams frequency, N-grams frequency document frequency, and N-grams frequency inverse document frequency of web page language identification. From the experimental results, it is proven that N-grams frequency gives the most promising result compared to other feature selection methods.