The similarity metric

Authors:
Ming Li;Xin Chen;Xin Li;Bin Ma;Paul Vitányi
Affiliations:
University of Waterloo, Waterloo, Ontario, Canada, and with BioInformatics Solutions Inc., Waterloo, Canada;University of California, Santa Barbara, CA;University of Western Ontario, London, Ontario, Canada;University of Western Ontario, London, Ontario, Canada;Center of Mathematics and Computer Science (CWI) and the University of Amsterdam. CWI, Kruislaan, Amsterdam, The Netherlands
Venue:
SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Year:
2003

Citing 13
Cited 30

A new challenge for compression algorithms: genetic sequences

Information Processing and Management: an International Journal - Special issue: data compression
Transforming cabbage into turnip: polynomial algorithm for sorting signed permutations by reversals

STOC '95 Proceedings of the twenty-seventh annual ACM symposium on Theory of computing
An introduction to Kolmogorov complexity and its applications (2nd ed.)

An introduction to Kolmogorov complexity and its applications (2nd ed.)
Approximate nearest neighbors and sequence comparison with block operations

STOC '00 Proceedings of the thirty-second annual ACM symposium on Theory of computing
Communication complexity of document exchange

SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
A practical algorithm for recovering the best supported edges of an evolutionary tree (extended abstract)

SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
Estimating true evolutionary distances between genomes

STOC '01 Proceedings of the thirty-third annual ACM symposium on Theory of computing
Inequalities for Shannon entropies and Kolmogorov Complexities

CCC '97 Proceedings of the 12th Annual IEEE Conference on Computational Complexity
Combinatorial Interpretation of Kolmogorov Complexity

COCO '00 Proceedings of the 15th Annual IEEE Conference on Computational Complexity
Independent Minimum Length Programs to Translate between Given Strings

COCO '00 Proceedings of the 15th Annual IEEE Conference on Computational Complexity
Logical Operations and Kolmogorov Complexity II

CCC '01 Proceedings of the 16th Annual Conference on Computational Complexity
Information distance

IEEE Transactions on Information Theory
Algorithmic statistics

IEEE Transactions on Information Theory

Towards parameter-free data mining

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
An efficient normalized maximum likelihood algorithm for DNA sequence compression

ACM Transactions on Information Systems (TOIS)
Substring compression problems

SODA '05 Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms
Algorithmic Clustering of Music Based on String Compression

Computer Music Journal
Gene Mapping and Marker Clustering Using Shannon's Mutual Information

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
A Bit Level Representation for Time Series Data Mining with Shape Based Similarity

Data Mining and Knowledge Discovery
Estimating relatedness via data compression

ICML '06 Proceedings of the 23rd international conference on Machine learning
A corpus-driven approach for design, evolution and alignment of ontologies

Proceedings of the 38th conference on Winter simulation
Compression-based data mining of sequential data

Data Mining and Knowledge Discovery
Spam Filtering Using Statistical Data Compression Models

The Journal of Machine Learning Research
Textual case-based reasoning for spam filtering: a comparison of feature-based and feature-free approaches

Artificial Intelligence Review
Testing genetic algorithm recombination strategies and the normalized compression distance for computer-generated music

AIKED'06 Proceedings of the 5th WSEAS International Conference on Artificial Intelligence, Knowledge Engineering and Data Bases
Content-based image retrieval with the normalized information distance

Computer Vision and Image Understanding
Catching the Drift: Using Feature-Free Case-Based Reasoning for Spam Filtering

ICCBR '07 Proceedings of the 7th international conference on Case-Based Reasoning: Case-Based Reasoning Research and Development
Some issues about outlier detection in rough set theory

Expert Systems with Applications: An International Journal
IP Covert Channel Detection

ACM Transactions on Information and System Security (TISSEC)
Analysis of Components for Generalization using Multidimensional Scaling

Fundamenta Informaticae
Capability and limitation of financial time-series data prediction using symbol string quantization

Proceedings of the 2009 International Conference on Hybrid Information Technology
Automated classification and analysis of internet malware

RAID'07 Proceedings of the 10th international conference on Recent advances in intrusion detection
A bounded distance metric for comparing tree structure

Information Systems
Evolving computer-generated music by means of the normalized compression distance

SMO'05 Proceedings of the 5th WSEAS international conference on Simulation, modelling and optimization
Image classification via LZ78 based string kernel: a comparative study

PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Similarity of objects and the meaning of words

TAMC'06 Proceedings of the Third international conference on Theory and Applications of Models of Computation
An LZ78 based string kernel

ADMA'05 Proceedings of the First international conference on Advanced Data Mining and Applications
CBTV: visualising case bases for similarity measure design and selection

ICCBR'10 Proceedings of the 18th international conference on Case-Based Reasoning Research and Development
Towards logical hypertext structure

IICS'04 Proceedings of the 4th international conference on Innovative Internet Community Systems
A General Similarity Framework for Horn Clause Logic

Fundamenta Informaticae
Analysis of Components for Generalization using Multidimensional Scaling

Fundamenta Informaticae
A framework for semantic-based similarity measures for ELH-concepts

JELIA'12 Proceedings of the 13th European conference on Logics in Artificial Intelligence
Learning figures with the Hausdorff metric by fractals--towards computable binary classification

Machine Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

A new class of metrics appropriate for measuring effective similarity relations between sequences, say one type of similarity per metric, is studied. We propose a new "normalized information distance", based on the noncomputable notion of Kolmogorov complexity, and show that it minorizes every metric in the class (that is, it is universal in that it discovers all effective similarities). We demonstrate that it too is a metric and takes values in [0, 1]; hence it may be called the similarity metric. This is a theory foundation for a new general practical tool. We give two distinctive applications in widely divergent areas (the experiments by necessity use just computable approximations to the target notions). First, we computationally compare whole mitochondrial genomes and infer their evolutionary history. This results in a first completely automatic computed whole mitochondrial phylogeny tree. Secondly, we give fully automatically computed language tree of 52 different language based on translated versions of the "Universal Declaration of Human Rights".