Language identification for creating language-specific Twitter collections

Authors:
Shane Bergsma;Paul McNamee;Mossaab Bagdouri;Clayton Fink;Theresa Wilson
Affiliations:
Johns Hopkins University;Johns Hopkins University and Johns Hopkins University Applied Physics Laboratory, Laurel, MD;University of Maryland, College Park, MD;Johns Hopkins University Applied Physics Laboratory, Laurel, MD;Johns Hopkins University
Venue:
LSM '12 Proceedings of the Second Workshop on Language in Social Media
Year:
2012

Citing 22
Cited 0

Automatic web search query generation to create minority language corpora

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Text Categorization Using Compression Models

DCC '00 Proceedings of the Conference on Data Compression
Language identification: a solved problem suitable for undergraduate instruction

Journal of Computing Sciences in Colleges
Determining an author's native language by mining a text for errors

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Spam Filtering Using Statistical Data Compression Models

The Journal of Machine Learning Research
LIBLINEAR: A Library for Large Linear Classification

The Journal of Machine Learning Research
Language ID in the context of harvesting language data off the web

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Cheap and fast---but is it good?: evaluating non-expert annotations for natural language tasks

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Compression and stylometry for author identification

IJCNN'09 Proceedings of the 2009 international joint conference on Neural Networks
Unsupervised modeling of Twitter conversations

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Language identification: the long and the short of the matter

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Language identification of names with SVMs

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Creating speech and language data with Amazon's Mechanical Turk

CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
Using Mechanical Turk to annotate lexicons for less commonly used languages

CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
Crowdsourcing translation: professional quality from non-professionals

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Discovering sociolinguistic associations with structured sparsity

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
The Arabic online commentary dataset: an annotated dataset of informal Arabic with high dialectal content

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
A comparison of language identification approaches on short, query-style texts

ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Data-driven response generation in social media

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Discriminating gender on Twitter

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Mapping the twitterverse in the developing world: an analysis of social media use in nigeria

SBP'12 Proceedings of the 5th international conference on Social Computing, Behavioral-Cultural Modeling and Prediction
Microblog language identification: overcoming the limitations of short, unedited and idiomatic text

Language Resources and Evaluation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Social media services such as Twitter offer an immense volume of real-world linguistic data. We explore the use of Twitter to obtain authentic user-generated text in low-resource languages such as Nepali, Urdu, and Ukrainian. Automatic language identification (LID) can be used to extract language-specific data from Twitter, but it is unclear how well LID performs on short, informal texts in low-resource languages. We address this question by annotating and releasing a large collection of tweets in nine languages, focusing on confusable languages using the Cyrillic, Arabic, and Devanagari scripts. This is the first publicly-available collection of LID-annotated tweets in non-Latin scripts, and should become a standard evaluation set for LID systems. We also advance the state-of-the-art by evaluating new, highly-accurate LID systems, trained both on our new corpus and on standard materials only. Both types of systems achieve a huge performance improvement over the existing state-of-the-art, correctly classifying around 98% of our gold standard tweets. We provide a detailed analysis showing how the accuracy of our systems vary along certain dimensions, such as the tweet-length and the amount of in- and out-of-domain training data.