SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Language identification in web pages
Proceedings of the 2005 ACM symposium on Applied computing
Identification of Document Language is Not yet a Completely Solved Problem
CIMCA '06 Proceedings of the International Conference on Computational Inteligence for Modelling Control and Automation and International Conference on Intelligent Agents Web Technologies and International Commerce
You are where you tweet: a content-based approach to geo-locating twitter users
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Tokenizing micro-blogging messages using a text classification approach
AND '10 Proceedings of the fourth workshop on Analytics for noisy unstructured text data
Classifying latent user attributes in twitter
SMUC '10 Proceedings of the 2nd international workshop on Search and mining user-generated contents
‘twazn me!!! ;(’ automatic authorship analysis of micro-blogging messages
NLDB'11 Proceedings of the 16th international conference on Natural language processing and information systems
Identifying automatic posting systems in microblogs
EPIA'11 Proceedings of the 15th Portugese conference on Progress in artificial intelligence
A comparison of language identification approaches on short, query-style texts
ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Microblog language identification: overcoming the limitations of short, unedited and idiomatic text
Language Resources and Evaluation
Hi-index | 0.00 |
It is difficult to determine the country of origin of the author of a short message based only on the text. This is an even more complex problem when more than one country uses the same native language. In this paper, we address the specific problem of detecting the two main variants of the Portuguese language --- European and Brazilian --- in Twitter micro-blogging data, by proposing and evaluating a set of high-precision features. We follow an automatic classification approach using a Naïve Bayes classifier, achieving 95% accuracy. We find that our system is adequate for real-time tweet classification.