Microblog language identification: overcoming the limitations of short, unedited and idiomatic text

Authors:
Simon Carter;Wouter Weerkamp;Manos Tsagkias
Affiliations:
ISLA, University of Amsterdam, Amsterdam, The Netherlands 1098 XH;ISLA, University of Amsterdam, Amsterdam, The Netherlands 1098 XH;ISLA, University of Amsterdam, Amsterdam, The Netherlands 1098 XH
Venue:
Language Resources and Evaluation
Year:
2013

Citing 12
Cited 7

A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Ensemble Methods in Machine Learning

MCS '00 Proceedings of the First International Workshop on Multiple Classifier Systems
Language identification in web pages

Proceedings of the 2005 ACM symposium on Applied computing
Twitter power: Tweets as electronic word of mouth

Journal of the American Society for Information Science and Technology
Microblogging during two natural hazards events: what twitter may contribute to situational awareness

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Earthquake shakes Twitter users: real-time event detection by social sensors

Proceedings of the 19th international conference on World wide web
Language identification: the long and the short of the matter

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Language identification of names with SVMs

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Incorporating query expansion and quality indicators in searching microblog posts

ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
Do all birds tweet the same?: characterizing twitter around the world

Proceedings of the 20th ACM international conference on Information and knowledge management
A comparison of language identification approaches on short, query-style texts

ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Predicting IMDB movie ratings using social media

ECIR'12 Proceedings of the 34th European conference on Advances in Information Retrieval

Language identification for creating language-specific Twitter collections

LSM '12 Proceedings of the Second Workshop on Language in Social Media
langid.py: an off-the-shelf language identification tool

ACL '12 Proceedings of the ACL 2012 System Demonstrations
Determining language variant in microblog messages

Proceedings of the 28th Annual ACM Symposium on Applied Computing
Microblog-genre noise and impact on semantic annotation accuracy

Proceedings of the 24th ACM Conference on Hypertext and Social Media
Personalized time-aware tweets summarization

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Pseudo test collections for training and tuning microblog rankers

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Signals from the crowd: uncovering social relationships through smartphone probes

Proceedings of the 2013 conference on Internet measurement conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

Multilingual posts can potentially affect the outcomes of content analysis on microblog platforms. To this end, language identification can provide a monolingual set of content for analysis. We find the unedited and idiomatic language of microblogs to be challenging for state-of-the-art language identification methods. To account for this, we identify five microblog characteristics that can help in language identification: the language profile of the blogger (blogger), the content of an attached hyperlink (link), the language profile of other users mentioned (mention) in the post, the language profile of a tag (tag), and the language of the original post (conversation), if the post we examine is a reply. Further, we present methods that combine these priors in a post-dependent and post-independent way. We present test results on 1,000 posts from five languages (Dutch, English, French, German, and Spanish), which show that our priors improve accuracy by 5 % over a domain specific baseline, and show that post-dependent combination of the priors achieves the best performance. When suitable training data does not exist, our methods still outperform a domain unspecific baseline. We conclude with an examination of the language distribution of a million tweets, along with temporal analysis, the usage of twitter features across languages, and a correlation study between classifications made and geo-location and language metadata fields.