A multivariate correlation distance for vector spaces

Authors:
Richard Connor;Robert Moss
Affiliations:
Department of Computer and Information Sciences, University of Strathclyde, Glasgow, Scotland, UK;Department of Computer and Information Sciences, University of Strathclyde, Glasgow, Scotland, UK
Venue:
SISAP'12 Proceedings of the 5th international conference on Similarity Search and Applications
Year:
2012

Citing 8
Cited 0

Pivoted document length normalization

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
An algorithmic framework for performing collaborative filtering

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
On Relevance, Probabilistic Indexing and Information Retrieval

Journal of the ACM (JACM)
A mathematical theory of communication

ACM SIGMOBILE Mobile Computing and Communications Review
Brief Communication: An Historical Note on the Origins of Probabilistic Indexing

Information Processing and Management: an International Journal
Structural Entropic Difference: A Bounded Distance Metric for Unordered Trees

SISAP '09 Proceedings of the 2009 Second International Workshop on Similarity Search and Applications
A bounded distance metric for comparing tree structure

Information Systems
Towards a universal information distance for structured data

Proceedings of the Fourth International Conference on SImilarity Search and APplications

Quantified Score

Hi-index	0.00

Visualization

Abstract

We investigate a distance metric, previously defined for the measurement of structured data, in the more general context of vector spaces. The metric has a basis in information theory and assesses the distance between two vectors in terms of their relative information content. The resulting metric gives an outcome based on the dimensional correlation, rather than magnitude, of the input vectors, in a manner similar to Cosine Distance. In this paper the metric is defined, and assessed, in comparison with Cosine Distance, for its major properties: semantics, properties for use within similarity search, and evaluation efficiency. We find that it is fairly well correlated with Cosine Distance in dense spaces, but its semantics are in some cases preferable. In a sparse space, it significantly outperforms Cosine Distance over TREC data and queries, the only large collection for which we have a human-ratified ground truth. This result is backed up by another experiment over movielens data. In dense Cartesian spaces it has better properties for use with similarity indices than either Cosine or Euclidean Distance. In its definitional form it is very expensive to evaluate for high-dimensional sparse vectors; to counter this, we show an algebraic rewrite which allows its evaluation to be performed more efficiently. Overall, when a multivariate correlation metric is required over positive vectors, SED seems to be a better choice than Cosine Distance in many circumstances.