Combinatorial optimization
ACM Computing Surveys (CSUR)
Computers and Intractability: A Guide to the Theory of NP-Completeness
Computers and Intractability: A Guide to the Theory of NP-Completeness
Introduction to Algorithms
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Partition-distance via the assignment problem
Bioinformatics
Introduction to Data Mining, (First Edition)
Introduction to Data Mining, (First Edition)
Comparing clusterings---an information based distance
Journal of Multivariate Analysis
Modified global k-means algorithm for clustering in gene expression data sets
WISB '06 Proceedings of the 2006 workshop on Intelligent systems for bioinformatics - Volume 73
On constructing an optimal consensus clustering from multiple clusterings
Information Processing Letters
Hi-index | 0.00 |
Given a set of elements N , a partition consists on dividing the set of elements into two or more disjoint clusters that cover all elements. A cluster contains a non-empty subset of elements. The number of clusters of a partition is less than or equal to |N |. Different partitioning algorithms for the same application will produce different partitions from the same set of elements. To compute the distance and find the consensus partition (also called as consensus clustering) between two or more partitions are important and interesting problems that arise in many applications such as bioinformatics and data mining. However, different distance functions between two or more partitions will usually need to be computed by different algorithms. In this paper, we discuss the k partition-distance problem which can be applied in bioinformatics. Given a set of elements N with k partitions, the k partition-distance problem is to delete the minimum number of elements from each partition such that all remaining partitions become identical. However, this problem has been shown to be NP-complete when k 2. We will present the first known approximation algorithm with performance ratio 2 to solve this problem in $O(k*\rho*|N|)$ time, where ρ is the maximum number of clusters of these k partitions. Then we perform our algorithm for simulation of random data and actual the set of organisms based on DNA markers. It will show that our approximation solution is at most twice the partition-distance of the optimal solution in practice.