Learning from labeled and unlabeled data: an empirical study across techniques and domains

Authors:
Nitesh V. Chawla;Grigoris Karakoulas
Affiliations:
Department of Computer Science & Engg., University of Notre Dame, IN;Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
Venue:
Journal of Artificial Intelligence Research
Year:
2005

Citing 18
Cited 10

Statistical analysis with missing data

Statistical analysis with missing data
C4.5: programs for machine learning

C4.5: programs for machine learning
Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
Analyzing the effectiveness and applicability of co-training

Proceedings of the ninth international conference on Information and knowledge management
Robust Classification for Imprecise Environments

Machine Learning
Learning and making decisions when costs and probabilities are both unknown

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Improving Short-Text Classification using Unlabeled Data for Classification Problems

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Enhancing Supervised Learning with Unlabeled Data

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Unlabeled Data Can Degrade Classification Performance of Generative Classifiers

Proceedings of the Fifteenth International Florida Artificial Intelligence Research Society Conference
Learning from Labeled and Unlabeled Data using Graph Mincuts

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Exploiting unlabeled data in ensemble methods

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Tree Induction for Probability-Based Ranking

Machine Learning
Semi-Supervised Mixture-of-Experts Classification

ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining
Dependency-based feature selection for clustering symbolic data

Intelligent Data Analysis
Adaptive mixtures of local experts

Neural Computation
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
The use of the area under the ROC curve in the evaluation of machine learning algorithms

Pattern Recognition

Large-scale bot detection for search engines

Proceedings of the 19th international conference on World wide web
Semi-supervised learning based on nearest neighbor rule and cut edges

Knowledge-Based Systems
A unifying view on dataset shift in classification

Pattern Recognition
Privacy-Preserving network aggregation

PAKDD'10 Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
Improved generative semisupervised learning based on finely grained component-conditional class labeling

Neural Computation
Design principles of massive, robust prediction systems

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Semi-supervised learning using greedy max-cut

The Journal of Machine Learning Research
Semi-supervised multi-label image classification based on nearest neighbor editing

Neurocomputing
On the characterization of noise filters for self-training semi-supervised in nearest neighbor classification

Neurocomputing
Semi-supervised projected model-based clustering

Data Mining and Knowledge Discovery

Quantified Score

Hi-index	0.00

Visualization

Abstract

There has been increased interest in devising learning techniques that combine unlabeled data with labeled data -- i.e. semi-supervised learning. However, to the best of our knowledge, no study has been performed across various techniques and different types and amounts of labeled and unlabeled data. Moreover, most of the published work on semi-supervised learning techniques assumes that the labeled and unlabeled data come from the same distribution. It is possible for the labeling process to be associated with a selection bias such that the distributions of data points in the labeled and unlabeled sets are different. Not correcting for such bias can result in biased function approximation with potentially poor performance. In this paper, we present an empirical study of various semi-supervised learning techniques on a variety of datasets. We attempt to answer various questions such as the effect of independence or relevance amongst features, the effect of the size of the labeled and unlabeled sets and the effect of noise. We also investigate the impact of sample-selection bias on the semi -supervised learning techniques under study and implement a bivariate probit technique particularly designed to correct for such bias.