Journal of Algorithms
BIRCH: an efficient data clustering method for very large databases
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Generalizing data to provide anonymity when disclosing information (abstract)
PODS '98 Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
A Microeconomic View of Data Mining
Data Mining and Knowledge Discovery
Practical Data-Oriented Microaggregation for Statistical Disclosure Control
IEEE Transactions on Knowledge and Data Engineering
Clustering with Instance-level Constraints
ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
k-anonymity: a model for protecting privacy
International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems
A microeconomic data mining problem: customer-oriented catalog segmentation
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
On the complexity of optimal K-anonymity
PODS '04 Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Handbook of Mathematical Functions, With Formulas, Graphs, and Mathematical Tables,
Handbook of Mathematical Functions, With Formulas, Graphs, and Mathematical Tables,
A Disc-based Approach to Data Summarization and Privacy Preservation
SSDBM '06 Proceedings of the 18th International Conference on Scientific and Statistical Database Management
Scalable Clustering Algorithms with Balancing Constraints
Data Mining and Knowledge Discovery
The complexity of non-hierarchical clustering with instance and cluster level constraints
Data Mining and Knowledge Discovery
Identifying and generating easy sets of constraints for clustering
AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
On robust and effective k-anonymity in large databases
PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Clustering Data Streams in Optimization and Geography Domains
PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Efficient algorithms for mining constrained frequent patterns from uncertain data
Proceedings of the 1st ACM SIGKDD Workshop on Knowledge Discovery from Uncertain Data
Efficient algorithms for the mining of constrained frequent patterns from uncertain data
ACM SIGKDD Explorations Newsletter
Data clustering with size constraints
Knowledge-Based Systems
Group RFM analysis as a novel framework to discover better customer consumption behavior
Expert Systems with Applications: An International Journal
Hi-index | 0.00 |
Clustering methods can be either data-driven or need-driven. Data-driven methods intend to discover the true structure of the underlying data while need-driven methods aims at organizing the true structure to meet certain application requirements. Thus, need-driven (e.g. constrained) clustering is able to find more useful and actionable clusters in applications such as energy aware sensor networks, privacy preservation, and market segmentation. However, the existing methods of constrained clustering require users to provide the number of clusters, which is often unknown in advance, but has a crucial impact on the clustering result. In this paper, we argue that a more natural way to generate actionable clusters is to let the application-specific constraints decide the number of clusters. For this purpose, we introduce a novel cluster model, Constraint-Driven Clustering (CDC), which finds an a priori unspecified number of compact clusters that satisfy all user-provided constraints. Two general types of constraints are considered, i.e. minimum significance constraints and minimum variance constraints, as well as combinations of these two types. We prove the NP-hardness of the CDC problem with different constraints. We propose a novel dynamic data structure, the CD-Tree, which organizes data points in leaf nodes such that each leaf node approximately satisfies the CDC constraints and minimizes the objective function. Based on CD-Trees, we develop an efficient algorithm to solve the new clustering problem. Our experimental evaluation on synthetic and real datasets demonstrates the quality of the generated clusters and the scalability of the algorithm.