Tweet classification by data compression

  • Authors:
  • Kyosuke Nishida;Ryohei Banno;Ko Fujimura;Takahide Hoshide

  • Affiliations:
  • NTT Corporation, Yokosuka, Kanagawa, Japan;Hokkaido University, Sapporo, Japan;NTT Corporation, Yokosuka, Kanagawa, Japan;NTT Corporation, Yokosuka, Kanagawa, Japan

  • Venue:
  • Proceedings of the 2011 international workshop on DETecting and Exploiting Cultural diversiTy on the social web
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

We propose a new method that uses data compression for classifying an unseen tweet as being related to an interesting topic or not. Our compression-based tweet classification method, called CTC, evaluates the compressibility of the tweet when given positive and negative examples. This enables our method to handle multilingual tweets in the same manner and to effectively utilize the word context of the tweet, which is extremely important information in the 140 character limit. Experiments with worldwide tweets assigned a single hashtag demonstrate that our method, which uses the Deflate algorithm (used in gzip) for empirical evaluations, achieved higher precision and recall rates than state-of-the-art online learning algorithms.