A Variant of N-Gram Based Language Classification

Authors:
Andrija Tomović;Predrag Janičić
Affiliations:
Friedrich Miescher Institute for Biomedical Research, Part of the Novartis Research Foundation, Maulbeerstrasse 66, CH-4058 Basel, Switzerland;Faculty of Mathematics, University of Belgrade, Studentski trg 16,11000 Belgrade, Serbia
Venue:
AI*IA '07 Proceedings of the 10th Congress of the Italian Association for Artificial Intelligence on AI*IA 2007: Artificial Intelligence and Human-Oriented Computing
Year:
2007

Citing 5
Cited 0

Scientific and Engineering Problem-Solving with the Computer

Scientific and Engineering Problem-Solving with the Computer
Data Mining: Introductory and Advanced Topics

Data Mining: Introductory and Advanced Topics
Language identification in web pages

Proceedings of the 2005 ACM symposium on Applied computing
Language and task independent text categorization with simple language models

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Identification of Document Language is Not yet a Completely Solved Problem

CIMCA '06 Proceedings of the International Conference on Computational Inteligence for Modelling Control and Automation and International Conference on Intelligent Agents Web Technologies and International Commerce

Quantified Score

Hi-index	0.00

Visualization

Abstract

Rapid classification of documents is of high-importance in many multilingual settings (such as international institutions or Internet search engines). This has been, for years, a well-known problem, addressed by different techniques, with excellent results. We address this problem by a simple n-grams based technique, a variation of techniques of this family. Our n-grams-based classification is very robust and successful, even for 20-fold classification, and even for short text strings. We give a detailed study for different lengths of strings and size of n-grams and we explore what classification parameters give the best performance. There is no requirement for vocabularies, but only for a few training documents. As a main corpus, we used a EU set of documents in 20 languages. Experimental comparison shows that our approach gives better results than four other popular approaches.