Determining language variant in microblog messages

Authors:
Gustavo Laboreiro;Matko Bošnjak;Luís Sarmento;Eduarda Mendes Rodrigues;Eugénio Oliveira
Affiliations:
Universidade do Porto;Universidade do Porto;Universidade do Porto;Universidade do Porto;Universidade do Porto
Venue:
Proceedings of the 28th Annual ACM Symposium on Applied Computing
Year:
2013

Citing 10
Cited 0

Comparing top k lists

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Language identification in web pages

Proceedings of the 2005 ACM symposium on Applied computing
Identification of Document Language is Not yet a Completely Solved Problem

CIMCA '06 Proceedings of the International Conference on Computational Inteligence for Modelling Control and Automation and International Conference on Intelligent Agents Web Technologies and International Commerce
You are where you tweet: a content-based approach to geo-locating twitter users

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Tokenizing micro-blogging messages using a text classification approach

AND '10 Proceedings of the fourth workshop on Analytics for noisy unstructured text data
Classifying latent user attributes in twitter

SMUC '10 Proceedings of the 2nd international workshop on Search and mining user-generated contents
‘twazn me!!! ;(’ automatic authorship analysis of micro-blogging messages

NLDB'11 Proceedings of the 16th international conference on Natural language processing and information systems
Identifying automatic posting systems in microblogs

EPIA'11 Proceedings of the 15th Portugese conference on Progress in artificial intelligence
A comparison of language identification approaches on short, query-style texts

ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Microblog language identification: overcoming the limitations of short, unedited and idiomatic text

Language Resources and Evaluation

Quantified Score

Hi-index	0.00

Visualization

Abstract

It is difficult to determine the country of origin of the author of a short message based only on the text. This is an even more complex problem when more than one country uses the same native language. In this paper, we address the specific problem of detecting the two main variants of the Portuguese language --- European and Brazilian --- in Twitter micro-blogging data, by proposing and evaluating a set of high-precision features. We follow an automatic classification approach using a Naïve Bayes classifier, achieving 95% accuracy. We find that our system is adequate for real-time tweet classification.