langid.py: an off-the-shelf language identification tool

  • Authors:
  • Marco Lui;Timothy Baldwin

  • Affiliations:
  • University of Melbourne, Australia;University of Melbourne, Australia

  • Venue:
  • ACL '12 Proceedings of the ACL 2012 System Demonstrations
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present langid.py, an off-the-shelf language identification tool. We discuss the design and implementation of langid.py, and provide an empirical comparison on 5 long-document datasets, and 2 datasets from the microblog domain. We find that langid.py maintains consistently high accuracy across all domains, making it ideal for end-users that require language identification without wanting to invest in preparation of in-domain training data.