Language identification for creating language-specific Twitter collections

  • Authors:
  • Shane Bergsma;Paul McNamee;Mossaab Bagdouri;Clayton Fink;Theresa Wilson

  • Affiliations:
  • Johns Hopkins University;Johns Hopkins University and Johns Hopkins University Applied Physics Laboratory, Laurel, MD;University of Maryland, College Park, MD;Johns Hopkins University Applied Physics Laboratory, Laurel, MD;Johns Hopkins University

  • Venue:
  • LSM '12 Proceedings of the Second Workshop on Language in Social Media
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Social media services such as Twitter offer an immense volume of real-world linguistic data. We explore the use of Twitter to obtain authentic user-generated text in low-resource languages such as Nepali, Urdu, and Ukrainian. Automatic language identification (LID) can be used to extract language-specific data from Twitter, but it is unclear how well LID performs on short, informal texts in low-resource languages. We address this question by annotating and releasing a large collection of tweets in nine languages, focusing on confusable languages using the Cyrillic, Arabic, and Devanagari scripts. This is the first publicly-available collection of LID-annotated tweets in non-Latin scripts, and should become a standard evaluation set for LID systems. We also advance the state-of-the-art by evaluating new, highly-accurate LID systems, trained both on our new corpus and on standard materials only. Both types of systems achieve a huge performance improvement over the existing state-of-the-art, correctly classifying around 98% of our gold standard tweets. We provide a detailed analysis showing how the accuracy of our systems vary along certain dimensions, such as the tweet-length and the amount of in- and out-of-domain training data.