Automatic web search query generation to create minority language corpora
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Text Categorization Using Compression Models
DCC '00 Proceedings of the Conference on Data Compression
Language identification: a solved problem suitable for undergraduate instruction
Journal of Computing Sciences in Colleges
Determining an author's native language by mining a text for errors
Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Spam Filtering Using Statistical Data Compression Models
The Journal of Machine Learning Research
LIBLINEAR: A Library for Large Linear Classification
The Journal of Machine Learning Research
Language ID in the context of harvesting language data off the web
EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Cheap and fast---but is it good?: evaluating non-expert annotations for natural language tasks
EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Compression and stylometry for author identification
IJCNN'09 Proceedings of the 2009 international joint conference on Neural Networks
Unsupervised modeling of Twitter conversations
HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Language identification: the long and the short of the matter
HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Language identification of names with SVMs
HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Creating speech and language data with Amazon's Mechanical Turk
CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
Using Mechanical Turk to annotate lexicons for less commonly used languages
CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
Crowdsourcing translation: professional quality from non-professionals
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Discovering sociolinguistic associations with structured sparsity
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
A comparison of language identification approaches on short, query-style texts
ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Data-driven response generation in social media
EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Discriminating gender on Twitter
EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Mapping the twitterverse in the developing world: an analysis of social media use in nigeria
SBP'12 Proceedings of the 5th international conference on Social Computing, Behavioral-Cultural Modeling and Prediction
Microblog language identification: overcoming the limitations of short, unedited and idiomatic text
Language Resources and Evaluation
Hi-index | 0.00 |
Social media services such as Twitter offer an immense volume of real-world linguistic data. We explore the use of Twitter to obtain authentic user-generated text in low-resource languages such as Nepali, Urdu, and Ukrainian. Automatic language identification (LID) can be used to extract language-specific data from Twitter, but it is unclear how well LID performs on short, informal texts in low-resource languages. We address this question by annotating and releasing a large collection of tweets in nine languages, focusing on confusable languages using the Cyrillic, Arabic, and Devanagari scripts. This is the first publicly-available collection of LID-annotated tweets in non-Latin scripts, and should become a standard evaluation set for LID systems. We also advance the state-of-the-art by evaluating new, highly-accurate LID systems, trained both on our new corpus and on standard materials only. Both types of systems achieve a huge performance improvement over the existing state-of-the-art, correctly classifying around 98% of our gold standard tweets. We provide a detailed analysis showing how the accuracy of our systems vary along certain dimensions, such as the tweet-length and the amount of in- and out-of-domain training data.