Foundations of statistical natural language processing
Foundations of statistical natural language processing
Text Categorization Using Compression Models
DCC '00 Proceedings of the Conference on Data Compression
The language observatory project (LOP)
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Multilingual ICT education: language observatory as a monitoring instrument
SEARCC '05 Proceedings of the 2005 South East Asia Regional Computer Science Confederation (SEARCC) Conference - Volume 46
IEICE - Transactions on Information and Systems
Design and implementation-algorithms of Amharic search engine system for Amharic web contents
NTMS'09 Proceedings of the 3rd international conference on New technologies, mobility and security
A high performance centroid-based classification approach for language identification
Pattern Recognition Letters
Hi-index | 0.00 |
An N-gram-based language, script, and encoding scheme-detection method is introduced in this article. The method detects language, script, and encoding schemes using a target text document encoded by computer by checking how many byte sequences of the target match the byte sequences that can appear in the texts belonging to a language, script, and encoding scheme. This detection mechanism is different from conventional N-gram-based methods in that its threshold for any category is uniquely predetermined. The method was originally created for a survey of web pages conducted to find how many web pages are written in a particular language, script, and encoding scheme. The requirement is that the method must be able to respond to either "correct answer" or "unable to detect" where "unable to detect" includes "other than registered." There are some minor problems with this method, but its effectiveness as a language, script, and encoding scheme-detection method has been confirmed by experiments.