Semi-supervised SVMs for classification with unknown class proportions and a small labeled dataset

Authors:
Sathiya Keerthi Selvaraj;Bigyan Bhar;Sundararajan Sellamanickam;Shirish Shevade
Affiliations:
Yahoo, Santa Clara, CA, USA;Indian Institute of Science, Bangalore, UNK, India;Yahoo, Bangalore, UNK, India;Indian Institute of Science, Bangalore, UNK, India
Venue:
Proceedings of the 20th ACM international conference on Information and knowledge management
Year:
2011

Citing 10
Cited 0

Parallel and Distributed Computation: Numerical Methods

Parallel and Distributed Computation: Numerical Methods
Transductive Inference for Text Classification using Support Vector Machines

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Learning with progressive transductive support vector machine

Pattern Recognition Letters
Training TSVM with the proper number of positive samples

Pattern Recognition Letters
Large scale semi-supervised linear SVMs

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Training linear SVMs in linear time

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Discriminative learning for differing training and test distributions

Proceedings of the 24th international conference on Machine learning
Optimization Techniques for Semi-Supervised Support Vector Machines

The Journal of Machine Learning Research
Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data

The Journal of Machine Learning Research
Semi-Supervised Learning

Semi-Supervised Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

In the design of practical web page classification systems one often encounters a situation in which the labeled training set is created by choosing some examples from each class; but, the class proportions in this set are not the same as those in the test distribution to which the classifier will be actually applied. The problem is made worse when the amount of training data is also small. In this paper we explore and adapt binary SVM methods that make use of unlabeled data from the test distribution, viz., Transductive SVMs (TSVMs) and expectation regularization/constraint (ER/EC) methods to deal with this situation. We empirically show that when the labeled training data is small, TSVM designed using the class ratio tuned by minimizing the loss on the labeled set yields the best performance; its performance is good even when the deviation between the class ratios of the labeled training set and the test set is quite large. When the labeled training data is sufficiently large, an unsupervised Gaussian mixture model can be used to get a very good estimate of the class ratio in the test set; also, when this estimate is used, both TSVM and EC/ER give their best possible performance, with TSVM coming out superior. The ideas in the paper can be easily extended to multi-class SVMs and MaxEnt models.