Parallel and Distributed Computation: Numerical Methods
Parallel and Distributed Computation: Numerical Methods
Transductive Inference for Text Classification using Support Vector Machines
ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Learning with progressive transductive support vector machine
Pattern Recognition Letters
Training TSVM with the proper number of positive samples
Pattern Recognition Letters
Large scale semi-supervised linear SVMs
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Training linear SVMs in linear time
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Discriminative learning for differing training and test distributions
Proceedings of the 24th international conference on Machine learning
Optimization Techniques for Semi-Supervised Support Vector Machines
The Journal of Machine Learning Research
Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data
The Journal of Machine Learning Research
Semi-Supervised Learning
Hi-index | 0.00 |
In the design of practical web page classification systems one often encounters a situation in which the labeled training set is created by choosing some examples from each class; but, the class proportions in this set are not the same as those in the test distribution to which the classifier will be actually applied. The problem is made worse when the amount of training data is also small. In this paper we explore and adapt binary SVM methods that make use of unlabeled data from the test distribution, viz., Transductive SVMs (TSVMs) and expectation regularization/constraint (ER/EC) methods to deal with this situation. We empirically show that when the labeled training data is small, TSVM designed using the class ratio tuned by minimizing the loss on the labeled set yields the best performance; its performance is good even when the deviation between the class ratios of the labeled training set and the test set is quite large. When the labeled training data is sufficiently large, an unsupervised Gaussian mixture model can be used to get a very good estimate of the class ratio in the test set; also, when this estimate is used, both TSVM and EC/ER give their best possible performance, with TSVM coming out superior. The ideas in the paper can be easily extended to multi-class SVMs and MaxEnt models.