Semi-supervised clustering: probabilistic models, algorithms and experiments

Authors:
Sugato Basu;Raymond J. Mooney
Affiliations:
The University of Texas at Austin;The University of Texas at Austin
Venue:
Semi-supervised clustering: probabilistic models, algorithms and experiments
Year:
2005

Citing 0
Cited 17

BoostCluster: boosting clustering by pairwise constraints

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Constraint Score: A new filter method for feature selection with pairwise constraints

Pattern Recognition
Connectivity-based parcellation of the cortical mantle using q-ball diffusion imaging

Journal of Biomedical Imaging - Recent Advances in Neuroimaging Methodology
A Semi-supervised Clustering Algorithm Based on Must-Link Set

ADMA '08 Proceedings of the 4th international conference on Advanced Data Mining and Applications
Improving supervised learning performance by using fuzzy clustering method to select training data

Journal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology - Fuzzy theory and technology with applications
Similarity Relation in Classification Problems

RSCTC '08 Proceedings of the 6th International Conference on Rough Sets and Current Trends in Computing
Fuzzy c-means clustering with prior biological knowledge

Journal of Biomedical Informatics
Partially supervised coreference resolution for opinion summarization through structured rule learning

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Learning assignment order of instances for the constrained K-means clustering algorithm

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
Metric learning for semi-supervised clustering using pairwise constraints and the geometrical structure of data

Intelligent Data Analysis
A new discriminant principal component analysis method with partial supervision

Neural Processing Letters
Non-linear metric learning using pairwise similarity and dissimilarity constraints and the geometrical structure of data

Pattern Recognition
Multi-view clustering with constraint propagation for learning with an incomplete mapping between views

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Learning low-rank kernel matrices for constrained clustering

Neurocomputing
Have I seen you before? Principles of Bayesian predictive classification revisited

Statistics and Computing
Scalable text and link analysis with mixed-topic link models

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Pattern classification and clustering: A review of partially supervised learning approaches

Pattern Recognition Letters

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clustering is one of the most common data mining tasks, used frequently for data categorization and analysis in both industry and academia. The focus of our research is on semi-supervised clustering, where we study how prior knowledge, gathered either from automated information sources or human supervision, can be incorporated into clustering algorithms. In this thesis, we present probabilistic models for semi-supervised clustering, develop algorithms based on these models and empirically validate their performances by extensive experiments on data sets from different domains, e.g., text analysis, hand-written character recognition, and bioinformatics. In many domains where clustering is applied, some prior knowledge is available either in the form of labeled data (specifying the category to which an instance belongs) or pairwise constraints on some of the instances (specifying whether two instances should be in same or different clusters). In this thesis, we first analyze effective methods of incorporating labeled supervision into prototype-based clustering algorithms, and propose two variants of the well-known KMeans algorithm that can improve their performance with limited labeled data. We then focus on the problem of semi-supervised clustering with constraints and show how this problem can be studied in the framework of a well-defined probabilistic generative model of a Hidden Markov Random Field. We derive an efficient KMeans-type iterative algorithm, HMRF-KMeans, for optimizing a semi-supervised clustering objective function defined on the HMRF model. We also give convergence guarantees of our algorithm for a large class of clustering distortion measures (e.g., squared Euclidean distance, KL divergence, and cosine distance). Finally, we develop an active learning algorithm for acquiring maximally informative pairwise constraints in an interactive query-driven framework, which to our knowledge is the first active learning algorithm for semi-supervised clustering with constraints. Other interesting problems of semi-supervised clustering that we discuss in this thesis include (1) semi-supervised graph-based clustering using kernels, (2) using prior knowledge to improve overlapping clustering of data, (3) integration of both constraint based and distance-based semi-supervised clustering methods using the HMRF model, and (4) model selection techniques that use the available supervision to automatically select the right number of clusters.