Learning classifiers when the training data is not IID

Authors:
Murat Dundar;Balaji Krishnapuram;Jinbo Bi;R. Bharat Rao
Affiliations:
Computer Aided Diagnosis & Therapy Group, Siemens Medical Solutions, Malvern, PA;Computer Aided Diagnosis & Therapy Group, Siemens Medical Solutions, Malvern, PA;Computer Aided Diagnosis & Therapy Group, Siemens Medical Solutions, Malvern, PA;Computer Aided Diagnosis & Therapy Group, Siemens Medical Solutions, Malvern, PA
Venue:
IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Year:
2007

Citing 1
Cited 2

Sparse Multinomial Logistic Regression: Fast Algorithms and Generalization Bounds

IEEE Transactions on Pattern Analysis and Machine Intelligence

LungCAD: a clinically approved, machine learning system for lung cancer detection

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Mean field variational Bayesian inference for support vector machine classification

Computational Statistics & Data Analysis

Quantified Score

Hi-index	0.01

Visualization

Abstract

Most methods for classifier design assume that the training samples are drawn independently and identically from an unknown data generating distribution, although this assumption is violated in several real life problems. Relaxing this i.i.d. assumption, we consider algorithms from the statistics literature for the more realistic situation where batches or sub-groups of training samples may have internal correlations, although the samples from different batches may be considered to be uncorrelated. Next, we propose simpler (more efficient) variants that scale well to large datasets; theoretical results from the literature are provided to support their validity. Experimental results from real-life computer aided diagnosis (CAD) problems indicate that relaxing the i.i.d. assumption leads to statistically significant improvements in the accuracy of the learned classifier. Surprisingly, the simpler algorithm proposed here is experimentally found to be even more accurate than the original version.