Yet another language identifier

Authors:
Martin Majliš
Affiliations:
Charles University in Prague
Venue:
EACL '12 Proceedings of the Student Research Workshop at the 13th Conference of the European Chapter of the Association for Computational Linguistics
Year:
2012

Citing 4
Cited 0

Language identification: a solved problem suitable for undergraduate instruction

Journal of Computing Sciences in Colleges
Language identification in web pages

Proceedings of the 2005 ACM symposium on Applied computing
The Unreasonable Effectiveness of Data

IEEE Intelligent Systems
Language identification: the long and the short of the matter

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Language identification of written text has been studied for several decades. Despite this fact, most of the research is focused on a few most spoken languages, whereas the minor ones are ignored. The identification of a larger number of languages brings new difficulties that do not occur for a few languages. These difficulties are causing decreased accuracy. The objective of this paper is to investigate the sources of such degradation. In order to isolate the impact of individual factors, 5 different algorithms and 3 different number of languages are used. The Support Vector Machine algorithm achieved an accuracy of 98% for 90 languages and the YALI algorithm based on a scoring function had an accuracy of 95.4%. The YALI algorithm has slightly lower accuracy but classifies around 17 times faster and its training is more than 4000 times faster. Three different data sets with various number of languages and sample sizes were prepared to overcome the lack of standardized data sets. These data sets are now publicly available.