Constraint-driven clustering

Authors:
Rong Ge;Martin Ester;Wen Jin;Ian Davidson
Affiliations:
Simon Fraser University;Simon Fraser University;Simon Fraser University;State University of New York: Albany
Venue:
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2007

Citing 15
Cited 6

Planar 3DM is NP-complete

Journal of Algorithms
BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Generalizing data to provide anonymity when disclosing information (abstract)

PODS '98 Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
A Microeconomic View of Data Mining

Data Mining and Knowledge Discovery
Practical Data-Oriented Microaggregation for Statistical Disclosure Control

IEEE Transactions on Knowledge and Data Engineering
Clustering with Instance-level Constraints

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
k-anonymity: a model for protecting privacy

International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems
A microeconomic data mining problem: customer-oriented catalog segmentation

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
On the complexity of optimal K-anonymity

PODS '04 Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Handbook of Mathematical Functions, With Formulas, Graphs, and Mathematical Tables,

Handbook of Mathematical Functions, With Formulas, Graphs, and Mathematical Tables,
A Disc-based Approach to Data Summarization and Privacy Preservation

SSDBM '06 Proceedings of the 18th International Conference on Scientific and Statistical Database Management
Scalable Clustering Algorithms with Balancing Constraints

Data Mining and Knowledge Discovery
The complexity of non-hierarchical clustering with instance and cluster level constraints

Data Mining and Knowledge Discovery
Identifying and generating easy sets of constraints for clustering

AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
On robust and effective k-anonymity in large databases

PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining

Clustering Data Streams in Optimization and Geography Domains

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Efficient algorithms for mining constrained frequent patterns from uncertain data

Proceedings of the 1st ACM SIGKDD Workshop on Knowledge Discovery from Uncertain Data
Efficient algorithms for the mining of constrained frequent patterns from uncertain data

ACM SIGKDD Explorations Newsletter
Data clustering with size constraints

Knowledge-Based Systems
Group RFM analysis as a novel framework to discover better customer consumption behavior

Expert Systems with Applications: An International Journal
Survey: Some results of Christos Papadimitriou on internet structure, network routing, and web information

Computer Science Review

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clustering methods can be either data-driven or need-driven. Data-driven methods intend to discover the true structure of the underlying data while need-driven methods aims at organizing the true structure to meet certain application requirements. Thus, need-driven (e.g. constrained) clustering is able to find more useful and actionable clusters in applications such as energy aware sensor networks, privacy preservation, and market segmentation. However, the existing methods of constrained clustering require users to provide the number of clusters, which is often unknown in advance, but has a crucial impact on the clustering result. In this paper, we argue that a more natural way to generate actionable clusters is to let the application-specific constraints decide the number of clusters. For this purpose, we introduce a novel cluster model, Constraint-Driven Clustering (CDC), which finds an a priori unspecified number of compact clusters that satisfy all user-provided constraints. Two general types of constraints are considered, i.e. minimum significance constraints and minimum variance constraints, as well as combinations of these two types. We prove the NP-hardness of the CDC problem with different constraints. We propose a novel dynamic data structure, the CD-Tree, which organizes data points in leaf nodes such that each leaf node approximately satisfies the CDC constraints and minimizes the objective function. Based on CD-Trees, we develop an efficient algorithm to solve the new clustering problem. Our experimental evaluation on synthetic and real datasets demonstrates the quality of the generated clusters and the scalability of the algorithm.