Tweet classification by data compression

Authors:
Kyosuke Nishida;Ryohei Banno;Ko Fujimura;Takahide Hoshide
Affiliations:
NTT Corporation, Yokosuka, Kanagawa, Japan;Hokkaido University, Sapporo, Japan;NTT Corporation, Yokosuka, Kanagawa, Japan;NTT Corporation, Yokosuka, Kanagawa, Japan
Venue:
Proceedings of the 2011 international workshop on DETecting and Exploiting Cultural diversiTy on the social web
Year:
2011

Citing 12
Cited 4

Data compression using dynamic Markov modelling

The Computer Journal
An introduction to Kolmogorov complexity and its applications (2nd ed.)

An introduction to Kolmogorov complexity and its applications (2nd ed.)
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Online Passive-Aggressive Algorithms

The Journal of Machine Learning Research
Spam Filtering Using Statistical Data Compression Models

The Journal of Machine Learning Research
Confidence-weighted linear classification

Proceedings of the 25th international conference on Machine learning
What is Twitter, a social network or a news media?

Proceedings of the 19th international conference on World wide web
Earthquake shakes Twitter users: real-time event detection by social sensors

Proceedings of the 19th international conference on World wide web
Short text classification in twitter to improve information filtering

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
On compression-based text classification

ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research
The similarity metric

IEEE Transactions on Information Theory
Clustering by compression

IEEE Transactions on Information Theory

Detect'11: international workshop on DETecting and Exploiting Cultural diversiTy on the social web

Proceedings of the 20th ACM international conference on Information and knowledge management
Sentiment and topic analysis on social media: a multi-task multi-label classification approach

Proceedings of the 5th Annual ACM Web Science Conference
Topic hierarchy construction for the organization of multi-source user generated contents

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Identifying purpose behind electoral tweets

Proceedings of the Second International Workshop on Issues of Sentiment Discovery and Opinion Mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose a new method that uses data compression for classifying an unseen tweet as being related to an interesting topic or not. Our compression-based tweet classification method, called CTC, evaluates the compressibility of the tweet when given positive and negative examples. This enables our method to handle multilingual tweets in the same manner and to effectively utilize the word context of the tweet, which is extremely important information in the 140 character limit. Experiments with worldwide tweets assigned a single hashtag demonstrate that our method, which uses the Deflate algorithm (used in gzip) for empirical evaluations, achieved higher precision and recall rates than state-of-the-art online learning algorithms.