Context-Based Similarity Measures for Categorical Databases

Authors:
Gautam Das;Heikki Mannila
Affiliations:
-;-
Venue:
PKDD '00 Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery
Year:
2000

Citing 16
Cited 4

Similarity-based queries

PODS '95 Proceedings of the fourteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Similarity-based queries for time series data

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Bayesian classification (AutoClass): theory and results

Advances in knowledge discovery and data mining
Inferring Web communities from link topology

Proceedings of the ninth ACM conference on Hypertext and hypermedia : links, objects, time and space---structure in hypermedia systems: links, objects, time and space---structure in hypermedia systems
CACTUS—clustering categorical data using summaries

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Authoritative sources in a hyperlinked environment

Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms
Efficient Similarity Search In Sequence Databases

FODO '93 Proceedings of the 4th International Conference on Foundations of Data Organization and Algorithms
Clustering Categorical Data: An Approach Based on Dynamical Systems

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Knowledge Discovery in Databases: An Attribute-Oriented Approach

VLDB '92 Proceedings of the 18th International Conference on Very Large Data Bases
Efficient and Effective Clustering Methods for Spatial Data Mining

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Mining Generalized Association Rules

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
On Similarity Queries for Time-Series Data: Constraint Specification and Implementation

CP '95 Proceedings of the First International Conference on Principles and Practice of Constraint Programming
ROCK: A Robust Clustering Algorithm for Categorical Attributes

ICDE '99 Proceedings of the 15th International Conference on Data Engineering
Similarity-based word sense disambiguation

Computational Linguistics - Special issue on word sense disambiguation

Local and Global Methods in Data Mining: Basic Techniques and Open Problems

ICALP '02 Proceedings of the 29th International Colloquium on Automata, Languages and Programming
Discovering Associations in Clinical Data: Application to Search for Prognostic Factors in Hodgkin's Disease

AIME '01 Proceedings of the 8th Conference on AI in Medicine in Europe: Artificial Intelligence Medicine
DISC: data-intensive similarity measure for categorical data

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part II
Coupled nominal similarity in unsupervised learning

Proceedings of the 20th ACM international conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Similarity between complex data objects is one of the central notions in data mining. We propose certain similarity (or distance) measures between various components of a 0/1 relation. We define measures between attributes, between rows, and between subrelations of the database. They find important applications in clustering, classification, and several other data mining processes. Our measures are based on the contexts of individual components. For example, two products (i.e., attributes) are deemed similar if their respective sets of customers (i.e., subrelations) are similar. This reveals more subtle relationships between components, something that is usually missing in simpler measures. Our problem of finding distance measures can be formulated as a system of nonlinear equations. We present an iterative algorithm which, when seeded with random initial values, converges quickly to stable distances in practice (typically requiring less than five iterations). The algorithm requires only one database scan. Results on artificial and real data show that our method is efficient, and produces results with intuitive appeal.