langid.py: an off-the-shelf language identification tool

Authors:
Marco Lui;Timothy Baldwin
Affiliations:
University of Melbourne, Australia;University of Melbourne, Australia
Venue:
ACL '12 Proceedings of the ACL 2012 System Demonstrations
Year:
2012

Citing 8
Cited 3

Efficient string matching: an aid to bibliographic search

Communications of the ACM
Induction of Decision Trees

Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
An extensive empirical study of feature selection metrics for text classification

The Journal of Machine Learning Research
Building Minority Language Corpora by Learning to Generate Web Search Queries

Knowledge and Information Systems
Language identification of search engine queries

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2
Language identification: the long and the short of the matter

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Microblog language identification: overcoming the limitations of short, unedited and idiomatic text

Language Resources and Evaluation

Improving two-thumb text entry on touchscreen devices

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Microblog-genre noise and impact on semantic annotation accuracy

Proceedings of the 24th ACM Conference on Hypertext and Social Media
Speaking swiss: languages and venues in foursquare

Proceedings of the 21st ACM international conference on Multimedia

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present langid.py, an off-the-shelf language identification tool. We discuss the design and implementation of langid.py, and provide an empirical comparison on 5 long-document datasets, and 2 datasets from the microblog domain. We find that langid.py maintains consistently high accuracy across all domains, making it ideal for end-users that require language identification without wanting to invest in preparation of in-domain training data.