Similarity of objects and the meaning of words

Authors:
Rudi Cilibrasi;Paul Vitanyi
Affiliations:
CWI, Amsterdam, The Netherlands;CWI, Amsterdam, The Netherlands
Venue:
TAMC'06 Proceedings of the Third international conference on Theory and Applications of Models of Computation
Year:
2006

Citing 14
Cited 5

Thermodynamics of computation and information distance

STOC '93 Proceedings of the twenty-fifth annual ACM symposium on Theory of computing
CYC: a large-scale investment in knowledge infrastructure

Communications of the ACM
An introduction to Kolmogorov complexity and its applications (2nd ed.)

An introduction to Kolmogorov complexity and its applications (2nd ed.)
A compression algorithm for DNA sequences and its applications in genome comparison

RECOMB '00 Proceedings of the fourth annual international conference on Computational molecular biology
A Tutorial on Support Vector Machines for Pattern Recognition

Data Mining and Knowledge Discovery
The similarity metric

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Selecting the right interestingness measure for association patterns

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
Towards parameter-free data mining

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning by googling

ACM SIGKDD Explorations Newsletter
Frequency estimates for statistical word similarity measures

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Algorithmic Clustering of Music Based on String Compression

Computer Music Journal
Shared information and program plagiarism detection

IEEE Transactions on Information Theory
Clustering by compression

IEEE Transactions on Information Theory

Sublinear Algorithms for Approximating String Compressibility

APPROX '07/RANDOM '07 Proceedings of the 10th International Workshop on Approximation and the 11th International Workshop on Randomization, and Combinatorial Optimization. Algorithms and Techniques
Analysis of tag within online social networks

Proceedings of the ACM 2009 international conference on Supporting group work
Semantic similarity measures for Malay sentences

ICADL'07 Proceedings of the 10th international conference on Asian digital libraries: looking back 10 years and forging new frontiers
Clustering the normalized compression distance for influenza virus data

Algorithms and Applications
Semantic news recommendation using wordnet and bing similarities

Proceedings of the 28th Annual ACM Symposium on Applied Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We survey the emerging area of compression-based, parameter-free, similarity distance measures useful in data-mining, pattern recognition, learning and automatic semantics extraction. Given a family of distances on a set of objects, a distance is universal up to a certain precision for that family if it minorizes every distance in the family between every two objects in the set, up to the stated precision (we do not require the universal distance to be an element of the family). We consider similarity distances for two types of objects: literal objects that as such contain all of their meaning, like genomes or books, and names for objects. The latter may have literal embodyments like the first type, but may also be abstract like “red” or “christianity.” For the first type we consider a family of computable distance measures corresponding to parameters expressing similarity according to particular features between pairs of literal objects. For the second type we consider similarity distances generated by web users corresponding to particular semantic relations between the (names for) the designated objects. For both families we give universal similarity distance measures, incorporating all particular distance measures in the family. In the first case the universal distance is based on compression and in the second case it is based on Google page counts related to search terms. In both cases experiments on a massive scale give evidence of the viability of the approaches.