Statistical analysis with missing data
Statistical analysis with missing data
C4.5: programs for machine learning
C4.5: programs for machine learning
Combining labeled and unlabeled data with co-training
COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Text Classification from Labeled and Unlabeled Documents using EM
Machine Learning - Special issue on information retrieval
Analyzing the effectiveness and applicability of co-training
Proceedings of the ninth international conference on Information and knowledge management
Robust Classification for Imprecise Environments
Machine Learning
Learning and making decisions when costs and probabilities are both unknown
Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Improving Short-Text Classification using Unlabeled Data for Classification Problems
ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Enhancing Supervised Learning with Unlabeled Data
ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Unlabeled Data Can Degrade Classification Performance of Generative Classifiers
Proceedings of the Fifteenth International Florida Artificial Intelligence Research Society Conference
Learning from Labeled and Unlabeled Data using Graph Mincuts
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Exploiting unlabeled data in ensemble methods
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Tree Induction for Probability-Based Ranking
Machine Learning
Semi-Supervised Mixture-of-Experts Classification
ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining
Dependency-based feature selection for clustering symbolic data
Intelligent Data Analysis
Adaptive mixtures of local experts
Neural Computation
SMOTE: synthetic minority over-sampling technique
Journal of Artificial Intelligence Research
Large-scale bot detection for search engines
Proceedings of the 19th international conference on World wide web
Semi-supervised learning based on nearest neighbor rule and cut edges
Knowledge-Based Systems
A unifying view on dataset shift in classification
Pattern Recognition
Privacy-Preserving network aggregation
PAKDD'10 Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
Design principles of massive, robust prediction systems
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Semi-supervised learning using greedy max-cut
The Journal of Machine Learning Research
Semi-supervised projected model-based clustering
Data Mining and Knowledge Discovery
Hi-index | 0.00 |
There has been increased interest in devising learning techniques that combine unlabeled data with labeled data -- i.e. semi-supervised learning. However, to the best of our knowledge, no study has been performed across various techniques and different types and amounts of labeled and unlabeled data. Moreover, most of the published work on semi-supervised learning techniques assumes that the labeled and unlabeled data come from the same distribution. It is possible for the labeling process to be associated with a selection bias such that the distributions of data points in the labeled and unlabeled sets are different. Not correcting for such bias can result in biased function approximation with potentially poor performance. In this paper, we present an empirical study of various semi-supervised learning techniques on a variety of datasets. We attempt to answer various questions such as the effect of independence or relevance amongst features, the effect of the size of the labeled and unlabeled sets and the effect of noise. We also investigate the impact of sample-selection bias on the semi -supervised learning techniques under study and implement a bivariate probit technique particularly designed to correct for such bias.