BoostCluster: boosting clustering by pairwise constraints
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Connectivity-based parcellation of the cortical mantle using q-ball diffusion imaging
Journal of Biomedical Imaging - Recent Advances in Neuroimaging Methodology
A Semi-supervised Clustering Algorithm Based on Must-Link Set
ADMA '08 Proceedings of the 4th international conference on Advanced Data Mining and Applications
Improving supervised learning performance by using fuzzy clustering method to select training data
Journal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology - Fuzzy theory and technology with applications
Similarity Relation in Classification Problems
RSCTC '08 Proceedings of the 6th International Conference on Rough Sets and Current Trends in Computing
Fuzzy c-means clustering with prior biological knowledge
Journal of Biomedical Informatics
EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Learning assignment order of instances for the constrained K-means clustering algorithm
IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
A new discriminant principal component analysis method with partial supervision
Neural Processing Letters
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Learning low-rank kernel matrices for constrained clustering
Neurocomputing
Have I seen you before? Principles of Bayesian predictive classification revisited
Statistics and Computing
Scalable text and link analysis with mixed-topic link models
Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Pattern classification and clustering: A review of partially supervised learning approaches
Pattern Recognition Letters
Hi-index | 0.00 |
Clustering is one of the most common data mining tasks, used frequently for data categorization and analysis in both industry and academia. The focus of our research is on semi-supervised clustering, where we study how prior knowledge, gathered either from automated information sources or human supervision, can be incorporated into clustering algorithms. In this thesis, we present probabilistic models for semi-supervised clustering, develop algorithms based on these models and empirically validate their performances by extensive experiments on data sets from different domains, e.g., text analysis, hand-written character recognition, and bioinformatics. In many domains where clustering is applied, some prior knowledge is available either in the form of labeled data (specifying the category to which an instance belongs) or pairwise constraints on some of the instances (specifying whether two instances should be in same or different clusters). In this thesis, we first analyze effective methods of incorporating labeled supervision into prototype-based clustering algorithms, and propose two variants of the well-known KMeans algorithm that can improve their performance with limited labeled data. We then focus on the problem of semi-supervised clustering with constraints and show how this problem can be studied in the framework of a well-defined probabilistic generative model of a Hidden Markov Random Field. We derive an efficient KMeans-type iterative algorithm, HMRF-KMeans, for optimizing a semi-supervised clustering objective function defined on the HMRF model. We also give convergence guarantees of our algorithm for a large class of clustering distortion measures (e.g., squared Euclidean distance, KL divergence, and cosine distance). Finally, we develop an active learning algorithm for acquiring maximally informative pairwise constraints in an interactive query-driven framework, which to our knowledge is the first active learning algorithm for semi-supervised clustering with constraints. Other interesting problems of semi-supervised clustering that we discuss in this thesis include (1) semi-supervised graph-based clustering using kernels, (2) using prior knowledge to improve overlapping clustering of data, (3) integration of both constraint based and distance-based semi-supervised clustering methods using the HMRF model, and (4) model selection techniques that use the available supervision to automatically select the right number of clusters.