When Does Cotraining Work in Real Data?

Authors:
Jun Du;Charles X. Ling;Zhi-Hua Zhou
Affiliations:
The University of Western Ontario, Ontario;The University of Western Ontario, Ontario;Nanjing University, Nanjing
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2011

Citing 0
Cited 7

Unlabeled data and multiple views

PSL'11 Proceedings of the First IAPR TC3 conference on Partially Supervised Learning
Semi-supervised clustering ensemble based on collaborative training

RSKT'12 Proceedings of the 7th international conference on Rough Sets and Knowledge Technology
Analysis of co-training algorithm with very small training sets

SSPR'12/SPR'12 Proceedings of the 2012 Joint IAPR international conference on Structural, Syntactic, and Statistical Pattern Recognition
Researcher homepage classification using unlabeled data

Proceedings of the 22nd international conference on World Wide Web
Semi-supervised learning combining co-training with active learning

Expert Systems with Applications: An International Journal
Analysis of unsupervised template update in biometric recognition systems

Pattern Recognition Letters
On the characterization of noise filters for self-training semi-supervised in nearest neighbor classification

Neurocomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Cotraining, a paradigm of semisupervised learning, is promised to alleviate effectively the shortage of labeled examples in supervised learning. The standard two-view cotraining requires the data set to be described by two views of features, and previous studies have shown that cotraining works well if the two views satisfy the sufficiency and independence assumptions. In practice, however, these two assumptions are often not known or ensured (even when the two views are given). More commonly, most supervised data sets are described by one set of attributes (one view). Thus, they need be split into two views in order to apply the standard two-view cotraining. In this paper, we first propose a novel approach to empirically verify the two assumptions of cotraining given two views. Then, we design several methods to split single view data sets into two views, in order to make cotraining work reliably well. Our empirical results show that, given a whole or a large labeled training set, our view verification and splitting methods are quite effective. Unfortunately, cotraining is called for precisely when the labeled training set is small. However, given small labeled training sets, we show that the two cotraining assumptions are difficult to verify, and view splitting is unreliable. Our conclusions for cotraining's effectiveness are mixed. If two views are given, and known to satisfy the two assumptions, cotraining works well. Otherwise, based on small labeled training sets, verifying the assumptions or splitting single view into two views are unreliable; thus, it is uncertain whether the standard cotraining would work or not.