A Mixture Model and EM-Based Algorithm for Class Discovery, Robust Classification, and Outlier Rejection in Mixed Labeled/Unlabeled Data Sets

Authors:
David J. Miller;John Browning
Affiliations:
-;-
Venue:
IEEE Transactions on Pattern Analysis and Machine Intelligence
Year:
2003

Citing 16
Cited 13

Algorithms for clustering data

Algorithms for clustering data
Characterization and detection of noise in clustering

Pattern Recognition Letters
Fundamentals of speech recognition

Fundamentals of speech recognition
On the exponential value of labeled samples

Pattern Recognition Letters
Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Combined learning and use for a mixture model equivalent to the RBF classifier

Neural Computation
Inductive learning algorithms and representations for text categorization

Proceedings of the seventh international conference on Information and knowledge management
Deterministic annealing EM algorithm

Neural Networks
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
Class discovery in gene expression data

RECOMB '01 Proceedings of the fifth annual international conference on Computational biology
Unsupervised Learning of Finite Mixture Models

IEEE Transactions on Pattern Analysis and Machine Intelligence
Neural Networks: A Comprehensive Foundation

Neural Networks: A Comprehensive Foundation
Webmining: learning from the world wide web

Computational Statistics & Data Analysis - Nonlinear methods and data mining
Toward Optimal Active Learning through Sampling Estimation of Error Reduction

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Transductive Inference for Text Classification using Support Vector Machines

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Reclassification as Supervised Clustering

Neural Computation

Mixture Modeling with Pairwise, Instance-Level Class Constraints

Neural Computation
Wavelet-based modeling of singular values for image texture classification

Machine Graphics & Vision International Journal
Image texture classification using wavelet packet transform and probabilistic neural network

Intelligent Data Analysis
Transferred Dimensionality Reduction

ECML PKDD '08 Proceedings of the European conference on Machine Learning and Knowledge Discovery in Databases - Part II
Robust Factorization Methods Using a Gaussian/Uniform Mixture Model

International Journal of Computer Vision
Using backward elimination with a new model order reduction algorithm to select best double mixture model for document clustering

Expert Systems with Applications: An International Journal
A classification algorithm based on local cluster centers with a few labeled training examples

Knowledge-Based Systems
Enterprise data classification using semantic web technologies

ISWC'10 Proceedings of the 9th international semantic web conference on The semantic web - Volume Part II
Finding audio-visual events in informal social gatherings

ICMI '11 Proceedings of the 13th international conference on multimodal interfaces
A finite mixture model for simultaneous high-dimensional clustering, localized feature selection and outlier rejection

Expert Systems with Applications: An International Journal
Improved generative semisupervised learning based on finely grained component-conditional class labeling

Neural Computation
A predictive deviance criterion for selecting a generative model in semi-supervised classification

Computational Statistics & Data Analysis
Semi-supervised projected model-based clustering

Data Mining and Knowledge Discovery

Quantified Score

Hi-index	0.15

Visualization

Abstract

Several authors have shown that, when labeled data are scarce, improved classifiers can be built by augmenting the training set with a large set of unlabeled examples and then performing suitable learning. These works assume each unlabeled sample originates from one of the (known) classes. Here, we assume each unlabeled sample comes from either a known or from a heretofore undiscovered class. We propose a novel mixture model which treats as observed data not only the feature vector and the class label, but also the fact of label presence/absence for each sample. Two types of mixture components are posited. "Predefined" components generate data from known classes and assume class labels are missing at random. "Nonpredefined" components only generate unlabeled data驴i.e., they capture exclusively unlabeled subsets, consistent with an outlier distribution or new classes. The predefined/nonpredefined natures are data-driven, learned along with the other parameters via an extension of the EM algorithm. Our modeling framework addresses problems involving both the known and unknown classes: 1) robust classifier design, 2) classification with rejections, and 3) identification of the unlabeled samples (and their components) from unknown classes. Case 3 is a step toward new class discovery. Experiments are reported for each application, including topic discovery for the Reuters domain. Experiments also demonstrate the value of label presence/absence data in learning accurate mixtures.