Distance functions for categorical and mixed variables

  • Authors:
  • Brendan McCane;Michael Albert

  • Affiliations:
  • Department of Computer Science, University of Otago, P.O. Box 56, Dunedin 9015, Otago, New Zealand;Department of Computer Science, University of Otago, P.O. Box 56, Dunedin 9015, Otago, New Zealand

  • Venue:
  • Pattern Recognition Letters
  • Year:
  • 2008

Quantified Score

Hi-index 0.10

Visualization

Abstract

In this paper, we compare three different measures for computing Mahalanobis-type distances between random variables consisting of several categorical dimensions or mixed categorical and numeric dimensions - regular simplex, tensor product space, and symbolic covariance. The tensor product space and symbolic covariance distances are new contributions. We test the methods on two application domains - classification and principal components analysis. We find that the tensor product space distance is impractical with most problems. Over all, the regular simplex method is the most successful in both domains, but the symbolic covariance method has several advantages including time and space efficiency, applicability to different contexts, and theoretical neatness.