Scientific and Engineering Problem-Solving with the Computer
Scientific and Engineering Problem-Solving with the Computer
Data Mining: Introductory and Advanced Topics
Data Mining: Introductory and Advanced Topics
Language identification in web pages
Proceedings of the 2005 ACM symposium on Applied computing
Language and task independent text categorization with simple language models
NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Identification of Document Language is Not yet a Completely Solved Problem
CIMCA '06 Proceedings of the International Conference on Computational Inteligence for Modelling Control and Automation and International Conference on Intelligent Agents Web Technologies and International Commerce
Hi-index | 0.00 |
Rapid classification of documents is of high-importance in many multilingual settings (such as international institutions or Internet search engines). This has been, for years, a well-known problem, addressed by different techniques, with excellent results. We address this problem by a simple n-grams based technique, a variation of techniques of this family. Our n-grams-based classification is very robust and successful, even for 20-fold classification, and even for short text strings. We give a detailed study for different lengths of strings and size of n-grams and we explore what classification parameters give the best performance. There is no requirement for vocabularies, but only for a few training documents. As a main corpus, we used a EU set of documents in 20 languages. Experimental comparison shows that our approach gives better results than four other popular approaches.