A database clustering methodology and tool

Authors:
Tae-Wan Ryu;Christoph F. Eick
Affiliations:
Department of Computer Science, California State University, Fullerton, CA;Department of Computer Science, University of Houston, Houston, TX
Venue:
Information Sciences—Informatics and Computer Science: An International Journal
Year:
2005

Citing 23
Cited 8

Toward memory-based reasoning

Communications of the ACM - Special issue on parallelism
Algorithms for clustering data

Algorithms for clustering data
Concept formation in structured domains

Concept formation knowledge and experience in unsupervised learning
Conceptual clustering in a first order logic representation

ECAI '92 Proceedings of the 10th European conference on Artificial intelligence
C4.5: programs for machine learning

C4.5: programs for machine learning
Mining association rules between sets of items in large databases

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
A Polynomial Approach to the Constructive Induction of Structural Knowledge

Machine Learning - Special issue on evaluating and changing representation
BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Bayesian classification (AutoClass): theory and results

Advances in knowledge discovery and data mining
UML distilled: applying the standard object modeling language

UML distilled: applying the standard object modeling language
Squashing flat files flatter

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
CACTUS—clustering categorical data using summaries

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Data mining: concepts and techniques

Data mining: concepts and techniques
An interference matching technique for inducing abstractions

Communications of the ACM
Knowledge Acquisition Via Incremental Conceptual Clustering

Machine Learning
PrefixSpan: Mining Sequential Patterns by Prefix-Projected Growth

Proceedings of the 17th International Conference on Data Engineering
Clustering Categorical Data: An Approach Based on Dynamical Systems

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Efficient and Effective Clustering Methods for Spatial Data Mining

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
STING: A Statistical Information Grid Approach to Spatial Data Mining

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Clustering categorical data: an approach based on dynamical systems

The VLDB Journal — The International Journal on Very Large Data Bases
Similarity Queries in Image Databases

CVPR '96 Proceedings of the 1996 Conference on Computer Vision and Pattern Recognition (CVPR '96)
Unifying representation and generalization: understanding hierarchically structured objects

Unifying representation and generalization: understanding hierarchically structured objects
Improved heterogeneous distance functions

Journal of Artificial Intelligence Research

Network intrusion detection: Evaluating cluster, discriminant, and logit analysis

Information Sciences: an International Journal
Intelligent physician segmentation and management based on KDD approach

Expert Systems with Applications: An International Journal
Discovering frequent itemsets by support approximation and itemset clustering

Data & Knowledge Engineering
Clustering high dimensional data: A graph-based relaxed optimization approach

Information Sciences: an International Journal
Towards understanding hierarchical clustering: A data distribution perspective

Neurocomputing
Classification by clustering decision tree-like classifier based on adjusted clusters

Expert Systems with Applications: An International Journal
Classification by clustering decision tree-like classifier based on adjusted clusters

Expert Systems with Applications: An International Journal
A decision support method, based on bounded rationality concepts, to reveal feature saliency in clustering problems

Decision Support Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clustering is a popular data analysis and data mining technique. However, applying traditional clustering algorithms directly to a database is not straightforward due to the fact that a database usually consists of structured and related data; moreover, there might be several object views of the database to be clustered, depending on a data analyst's particular interest. Finally, in many cases, there is a data model discrepancy between the format used to store the database to be analyzed and the representation format that clustering algorithms expect as their input. These discrepancies have been mostly ignored by current research.This paper focuses on identifying those discrepancies and on analyzing their impact on the application of clustering techniques to databases. We are particularly interested in the question on how clustering algorithms can be generalized to become more directly applicable to real-world databases. The paper introduces methodologies, techniques, and tools that serve this purpose. We propose a data set representation framework for database clustering that characterizes objects to be clustered through sets of tuples, and introduce preprocessing techniques and tools to generate object views based on this framework. Moreover, we introduce bag-oriented similarity measures and clustering algorithms that are suitable for our proposed data set representation framework. We also demonstrate that our approach is capable of dealing with relationship information commonly found in databases through the bag-oriented clustering. We also argue that our bag-oriented data representation framework is more suitable for database clustering than the commonly used flat file format and produce better quality of clusters.