DISC: data-intensive similarity measure for categorical data

Authors:
Aditya Desai;Himanshu Singh;Vikram Pudi
Affiliations:
International Institute of Information Technology-Hyderabad, Hyderabad, India;International Institute of Information Technology-Hyderabad, Hyderabad, India;International Institute of Information Technology-Hyderabad, Hyderabad, India
Venue:
PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part II
Year:
2011

Citing 16
Cited 0

Pictures of relevance: a geometric analysis of similarity measures

Journal of the American Society for Information Science
Algorithms for clustering data

Algorithms for clustering data
On the Handling of Continuous-Valued Attributes in Decision Tree Generation

Machine Learning
A comparative assessment of measures of similarity of fuzzy values

Fuzzy Sets and Systems
A context similarity measure

ECML-94 Proceedings of the European conference on machine learning on Machine Learning
A comparative study of similarity measures

Fuzzy Sets and Systems
CACTUS—clustering categorical data using summaries

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Clustering Algorithms

Clustering Algorithms
Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values

Data Mining and Knowledge Discovery
A performance evaluation of similarity measures, document term weighting schemes and representations in a Boolean environment

SIGIR '80 Proceedings of the 3rd annual ACM conference on Research and development in information retrieval
Context-Based Similarity Measures for Categorical Databases

PKDD '00 Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery
Clustering categorical data: an approach based on dynamical systems

The VLDB Journal — The International Journal on Very Large Data Bases
ROCK: A Robust Clustering Algorithm for Categorical Attributes

ICDE '99 Proceedings of the 15th International Conference on Data Engineering
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Improved heterogeneous distance functions

Journal of Artificial Intelligence Research
Electricity based external similarity of categorical attributes

PAKDD'03 Proceedings of the 7th Pacific-Asia conference on Advances in knowledge discovery and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

The concept of similarity is fundamentally important in almost every scientific field. Clustering, distance-based outlier detection, classification, regression and search are major data mining techniques which compute the similarities between instances and hence the choice of a particular similarity measure can turn out to be a major cause of success or failure of the algorithm. The notion of similarity or distance for categorical data is not as straightforward as for continuous data and hence, is a major challenge. This is due to the fact that different values taken by a categorical attribute are not inherently ordered and hence a notion of direct comparison between two categorical values is not possible. In addition, the notion of similarity can differ depending on the particular domain, dataset, or task at hand. In this paper we present a new similarity measure for categorical data DISC - Data-Intensive Similarity Measure for Categorical Data. DISC captures the semantics of the data without any help from domain expert for defining the similarity. In addition to these, it is generic and simple to implement. These desirable features make it a very attractive alternative to existing approaches. Our experimental study compares it with 14 other similarity measures on 24 standard real datasets, out of which 12 are used for classification and 12 for regression, and shows that it is more accurate than all its competitors.