Improving tweet stream classification by detecting changes in word probability

Authors:
Kyosuke Nishida;Takahide Hoshide;Ko Fujimura
Affiliations:
NTT Service Evolution Laboratories, NTT Corporation, Kanagawa, Japan;NTT Service Evolution Laboratories, NTT Corporation, Kanagawa, Japan;Otsuma Women's University, Tokyo, Japan
Venue:
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Year:
2012

Citing 28
Cited 5

Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
Learning in the presence of concept drift and hidden contexts

Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Detecting Concept Drift with Support Vector Machines

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications

CPM '01 Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching
Augmenting Naive Bayes Classifiers with Statistical Language Models

Information Retrieval
Replacing suffix trees with enhanced suffix arrays

Journal of Discrete Algorithms - SPIRE 2002
Fast and space efficient string kernels using suffix arrays

ICML '06 Proceedings of the 23rd international conference on Machine learning
Tackling concept drift by temporal inductive transfer

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Temporal Data Mining in Dynamic Feature Spaces

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Learning drifting concepts: Example selection vs. example weighting

Intelligent Data Analysis
Understanding temporal aspects in document classification

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Boosting classifiers for drifting concepts

Intelligent Data Analysis - Knowlegde Discovery from Data Streams
Local likelihood modeling of temporal text streams

Proceedings of the 25th international conference on Machine learning
Exploiting temporal contexts in text classification

Proceedings of the 17th ACM conference on Information and knowledge management
An adaptive personalized news dissemination system

Journal of Intelligent Information Systems
Linear Suffix Array Construction by Almost Pure Induced-Sorting

DCC '09 Proceedings of the 2009 Data Compression Conference
ECUE: A Spam Filter that Uses Machine Learning to Track Concept Drift

Proceedings of the 2006 conference on ECAI 2006: 17th European Conference on Artificial Intelligence August 29 -- September 1, 2006, Riva del Garda, Italy
Tracking recurring contexts using ensemble classifiers: an application to email filtering

Knowledge and Information Systems
Earthquake shakes Twitter users: real-time event detection by social sensors

Proceedings of the 19th international conference on World wide web
Temporally-aware algorithms for document classification

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Short text classification in twitter to improve information filtering

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Topic classification in social media using metadata from hyperlinked objects

ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
Target-dependent Twitter sentiment classification

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Cooooooooooooooollllllllllllll!!!!!!!!!!!!!!: using word lengthening to detect sentiment in microblogs

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
The zero-frequency problem: estimating the probabilities of novel events in adaptive text compression

IEEE Transactions on Information Theory
Dealing with concept drift and class imbalance in multi-label stream classification

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Two
Suffix arrays on words

CPM'07 Proceedings of the 18th annual conference on Combinatorial Pattern Matching

Sentiment and topic analysis on social media: a multi-task multi-label classification approach

Proceedings of the 5th Annual ACM Web Science Conference
Event identification for local areas using social media streaming data

Proceedings of the ACM SIGMOD Workshop on Databases and Social Networks
Steeler nation, 12th man, and boo birds: classifying Twitter user interests using time series

Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
Classifying microblogs for disasters

Proceedings of the 18th Australasian Document Computing Symposium
Multi-modal distance metric learning

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose a classification model of tweet streams in Twitter, which are representative of document streams whose statistical properties will change over time. Our model solves several problems that hinder the classification of tweets; in particular, the problem that the probabilities of word occurrence change at different rates for different words. Our model switches between two probability estimates based on full and recent data for each word when detecting changes in word probability. This switching enables our model to achieve both accurate learning of stationary words and quick response to bursty words. We then explain how to implement our model by using a word suffix array, which is a full-text search index. Using the word suffix array allows our model to handle the temporal attributes of word n-grams effectively. Experiments on three tweet data sets demonstrate that our model offers statistically significant higher topic-classification accuracy than conventional temporally-aware classification models.