Tokenizing micro-blogging messages using a text classification approach

Authors:
Gustavo Laboreiro;Luís Sarmento;Jorge Teixeira;Eugénio Oliveira
Affiliations:
LIACC - Faculdade de Engenharia da Faculdade do Porto, Porto, Portugal;Labs SAPO and LIACC - Faculdade de Engenharia da Faculdade do Porto, Porto, Portugal;Labs SAPO and LIACC - Faculdade de Engenharia da Faculdade do Porto, Porto, Portugal;LIACC - Faculdade de Engenharia da Faculdade do Porto, Porto, Portugal
Venue:
AND '10 Proceedings of the fourth workshop on Analytics for noisy unstructured text data
Year:
2010

Citing 10
Cited 6

Techniques for automatically correcting words in text

ACM Computing Surveys (CSUR)
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Chunking with support vector machines

NAACL '01 Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies
Rule writing or annotation: cost-efficient resource usage for base noun phrase chunking

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Sentence boundary detection in conversational speech transcripts using noisily labeled examples

International Journal on Document Analysis and Recognition
Named entity normalization in user generated content

Proceedings of the second workshop on Analytics for noisy unstructured text data
Rule based synonyms for entity extraction from noisy text

Proceedings of the second workshop on Analytics for noisy unstructured text data
Unsupervised learning of multilingual short message service (SMS) dialect from noisy examples

Proceedings of the second workshop on Analytics for noisy unstructured text data
Efficient identification of starters and followers in social media

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Using emoticons to reduce dependency in machine learning techniques for sentiment classification

ACLstudent '05 Proceedings of the ACL Student Research Workshop

A bootstrapping approach for training a NER with conditional random fields

EPIA'11 Proceedings of the 15th Portugese conference on Progress in artificial intelligence
Profanity use in online communities

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Automatic identification of personal insults on social news sites

Journal of the American Society for Information Science and Technology
Understanding the top grass roots in sina-weibo

IScIDE'12 Proceedings of the third Sino-foreign-interchange conference on Intelligent Science and Intelligent Data Engineering
Discovering filter keywords for company name disambiguation in twitter

Expert Systems with Applications: An International Journal
Determining language variant in microblog messages

Proceedings of the 28th Annual ACM Symposium on Applied Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The automatic processing of microblogging messages may be problematic, even in the case of very elementary operations such as tokenization. The problems arise from the use of non-standard language, including media-specific words (e.g. "2day", "gr8", "tl;dr", "loool"), emoticons (e.g. "(ò_ó)", "(=^-^=)"), non-standard letter casing (e.g. "dr. Fred") and unusual punctuation (e.g. ".... ..", "!??!!!?", ",,,"). Additionally, spelling errors are abundant (e.g. "I;m"), and we can frequently find more than one language (with different tokenization requirements) in the same short message. For being efficient in such environment, manually-developed rule-based tokenizer systems have to deal with many conditions and exceptions, which makes them difficult to build and maintain. We present a text classification approach for tokenizing Twitter messages, which address complex cases successfully and which is relatively simple to set up and maintain. For that, we created a corpus consisting of 2500 manually tokenized Twitter messages -- a task that is simple for human annotators -- and we trained an SVM classifier for separating tokens at certain discontinuity characters. For comparison, we created a baseline rule-based system designed specifically for dealing with typical problematic situations. Results show that we can achieve F-measures of 96% with the classification-based approach, much above the performance obtained by the baseline rule-based tokenizer (85%). Also, subsequent analysis allowed us to identify typical tokenization errors, which we show that can be partially solved by adding some additional descriptive examples to the training corpus and re-training the classifier.