Scalable mining of social data using stochastic gradient fisher scoring

Authors:
Jeon-Hyung Kang;Kristina Lerman
Affiliations:
University of Southern California Information Sciences Institute, Marina del Rey, CA, USA;University of Southern California Information Sciences Institute, Marina del Rey, CA, USA
Venue:
Proceedings of the 2013 workshop on Data-driven user behavioral modelling and mining from social media
Year:
2013

Citing 4
Cited 0

Latent dirichlet allocation

The Journal of Machine Learning Research
Large-scale matrix factorization with distributed stochastic gradient descent

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Collaborative topic modeling for recommending scientific articles

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
LA-LDA: a limited attention topic model for social recommendation

SBP'13 Proceedings of the 6th international conference on Social Computing, Behavioral-Cultural Modeling and Prediction

Quantified Score

Hi-index	0.00

Visualization

Abstract

The rapid growth of social data in the form of videos, microblog posts and other items shared on social media presents new opportunities for learning user behavior and preferences. Bayesian models have been used widely for modeling social data, since they capture uncertainty and prior knowledge, avoid overfitting, and can be easily extended to incorporate new types of data. Researchers have used a variety of inference procedures to learn model parameters from data. Specifically, Stochastic Gradient Fisher Scoring (SGFS) method was recently proposed for efficient inference. This method samples from a Bayesian posterior using small number of data samples in each iteration, instead of the entire data, to speed up the inference process. In this paper we explore the feasibility of SGFS for social data mining. We find that SGFS often outperforms other inference methods in dense data, but it fails in the sparse "long-tail" where there are not enough instances for it to learn parameters. This is problematic, because social data often has long-tailed distribution. To address this problem, we propose hybrid SGFS (hSGFS) and evaluate its performance on a variety of social data sets. We find that hSGFS is better able to predict held out items in data sets that have a long-tailed distribution.