Efficient string matching: an aid to bibliographic search
Communications of the ACM
Machine Learning
A Comparative Study on Feature Selection in Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
An extensive empirical study of feature selection metrics for text classification
The Journal of Machine Learning Research
Building Minority Language Corpora by Learning to Generate Web Search Queries
Knowledge and Information Systems
Language identification of search engine queries
ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2
Language identification: the long and the short of the matter
HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Microblog language identification: overcoming the limitations of short, unedited and idiomatic text
Language Resources and Evaluation
Improving two-thumb text entry on touchscreen devices
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Microblog-genre noise and impact on semantic annotation accuracy
Proceedings of the 24th ACM Conference on Hypertext and Social Media
Speaking swiss: languages and venues in foursquare
Proceedings of the 21st ACM international conference on Multimedia
Hi-index | 0.00 |
We present langid.py, an off-the-shelf language identification tool. We discuss the design and implementation of langid.py, and provide an empirical comparison on 5 long-document datasets, and 2 datasets from the microblog domain. We find that langid.py maintains consistently high accuracy across all domains, making it ideal for end-users that require language identification without wanting to invest in preparation of in-domain training data.