Distance functions for categorical and mixed variables

Authors:
Brendan McCane;Michael Albert
Affiliations:
Department of Computer Science, University of Otago, P.O. Box 56, Dunedin 9015, Otago, New Zealand;Department of Computer Science, University of Otago, P.O. Box 56, Dunedin 9015, Otago, New Zealand
Venue:
Pattern Recognition Letters
Year:
2008

Citing 7
Cited 3

Toward memory-based reasoning

Communications of the ACM - Special issue on parallelism
A Weighted Nearest Neighbor Algorithm for Learning with Symbolic Features

Machine Learning
Generalization of the Mahalanobis distance in the mixed case

Journal of Multivariate Analysis
Unifying instance-based and rule-based induction

Machine Learning
Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values

Data Mining and Knowledge Discovery
Unsupervised Learning with Mixed Numeric and Nominal Data

IEEE Transactions on Knowledge and Data Engineering
Improved heterogeneous distance functions

Journal of Artificial Intelligence Research

A measure of variance for hierarchical nominal attributes

Information Sciences: an International Journal
WMCA: a weighted matrix coverage based approach to cluster multivariate time series

ICNC'09 Proceedings of the 5th international conference on Natural computation
BRACID: a comprehensive approach to learning rules from imbalanced data

Journal of Intelligent Information Systems

Quantified Score

Hi-index	0.10

Visualization

Abstract

In this paper, we compare three different measures for computing Mahalanobis-type distances between random variables consisting of several categorical dimensions or mixed categorical and numeric dimensions - regular simplex, tensor product space, and symbolic covariance. The tensor product space and symbolic covariance distances are new contributions. We test the methods on two application domains - classification and principal components analysis. We find that the tensor product space distance is impractical with most problems. Over all, the regular simplex method is the most successful in both domains, but the symbolic covariance method has several advantages including time and space efficiency, applicability to different contexts, and theoretical neatness.