A re-examination of text categorization methods
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Ensemble Methods in Machine Learning
MCS '00 Proceedings of the First International Workshop on Multiple Classifier Systems
Language identification in web pages
Proceedings of the 2005 ACM symposium on Applied computing
Twitter power: Tweets as electronic word of mouth
Journal of the American Society for Information Science and Technology
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Earthquake shakes Twitter users: real-time event detection by social sensors
Proceedings of the 19th international conference on World wide web
Language identification: the long and the short of the matter
HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Language identification of names with SVMs
HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Incorporating query expansion and quality indicators in searching microblog posts
ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
Do all birds tweet the same?: characterizing twitter around the world
Proceedings of the 20th ACM international conference on Information and knowledge management
A comparison of language identification approaches on short, query-style texts
ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Predicting IMDB movie ratings using social media
ECIR'12 Proceedings of the 34th European conference on Advances in Information Retrieval
Language identification for creating language-specific Twitter collections
LSM '12 Proceedings of the Second Workshop on Language in Social Media
langid.py: an off-the-shelf language identification tool
ACL '12 Proceedings of the ACL 2012 System Demonstrations
Determining language variant in microblog messages
Proceedings of the 28th Annual ACM Symposium on Applied Computing
Microblog-genre noise and impact on semantic annotation accuracy
Proceedings of the 24th ACM Conference on Hypertext and Social Media
Personalized time-aware tweets summarization
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Pseudo test collections for training and tuning microblog rankers
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Signals from the crowd: uncovering social relationships through smartphone probes
Proceedings of the 2013 conference on Internet measurement conference
Hi-index | 0.00 |
Multilingual posts can potentially affect the outcomes of content analysis on microblog platforms. To this end, language identification can provide a monolingual set of content for analysis. We find the unedited and idiomatic language of microblogs to be challenging for state-of-the-art language identification methods. To account for this, we identify five microblog characteristics that can help in language identification: the language profile of the blogger (blogger), the content of an attached hyperlink (link), the language profile of other users mentioned (mention) in the post, the language profile of a tag (tag), and the language of the original post (conversation), if the post we examine is a reply. Further, we present methods that combine these priors in a post-dependent and post-independent way. We present test results on 1,000 posts from five languages (Dutch, English, French, German, and Spanish), which show that our priors improve accuracy by 5 % over a domain specific baseline, and show that post-dependent combination of the priors achieves the best performance. When suitable training data does not exist, our methods still outperform a domain unspecific baseline. We conclude with an examination of the language distribution of a million tweets, along with temporal analysis, the usage of twitter features across languages, and a correlation study between classifications made and geo-location and language metadata fields.