The nature of statistical learning theory
The nature of statistical learning theory
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features
ECML '98 Proceedings of the 10th European Conference on Machine Learning
An information-theoretic measure for document similarity
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Text classification using string kernels
The Journal of Machine Learning Research
Character N-Gram Tokenization for European Language Text Retrieval
Information Retrieval
Automatic language identification of written texts
Proceedings of the 2004 ACM symposium on Applied computing
Language identification in web pages
Proceedings of the 2005 ACM symposium on Applied computing
Introduction to Information Retrieval
Introduction to Information Retrieval
Language ID in the context of harvesting language data off the web
EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
LaTeCH-SHELT&R '09 Proceedings of the EACL 2009 Workshop on Language Technology and Resources for Cultural Heritage, Social Sciences, Humanities, and Education
Lexical normalisation of short text messages: makn sens a #twitter
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Yet another language identifier
EACL '12 Proceedings of the Student Research Workshop at the 13th Conference of the European Chapter of the Association for Computational Linguistics
Language identification for creating language-specific Twitter collections
LSM '12 Proceedings of the Second Workshop on Language in Social Media
langid.py: an off-the-shelf language identification tool
ACL '12 Proceedings of the ACL 2012 System Demonstrations
Lexical normalization for social media text
ACM Transactions on Intelligent Systems and Technology (TIST) - Special section on twitter and microblogging services, social recommender systems, and CAMRa2010: Movie recommendation in context
A Comprehensive Study of Techniques for URL-Based Web Page Language Classification
ACM Transactions on the Web (TWEB)
Microblog language identification: overcoming the limitations of short, unedited and idiomatic text
Language Resources and Evaluation
Proceedings of the 22nd international conference on World Wide Web companion
Using topic models for Twitter hashtag recommendation
Proceedings of the 22nd international conference on World Wide Web companion
Signals from the crowd: uncovering social relationships through smartphone probes
Proceedings of the 2013 conference on Internet measurement conference
Hi-index | 0.01 |
Language identification is the task of identifying the language a given document is written in. This paper describes a detailed examination of what models perform best under different conditions, based on experiments across three separate datasets and a range of tokenisation strategies. We demonstrate that the task becomes increasingly difficult as we increase the number of languages, reduce the amount of training data and reduce the length of documents. We also show that it is possible to perform language identification without having to perform explicit character encoding detection.