Similarity Calculation with Length Delimiting Dictionary Distance

  • Authors:
  • Andre Burkovski;Sebastian Klenk;Gunther Heidemann

  • Affiliations:
  • -;-;-

  • Venue:
  • ICTAI '11 Proceedings of the 2011 IEEE 23rd International Conference on Tools with Artificial Intelligence
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

The Normalized Compression Distance (NCD) has gained considerable interest in pattern recognition as a similarity measure applicable to unstructured data of very different domains, such as text, DNA sequences, or images. NCD uses existing compression programs such as gzip to compute similarity between objects. NCD has unique features: It does not require any prior knowledge, data preprocessing, feature extraction, domain adaptation or any parameter settings. Further, the NCD can be applied to symbolic data and raw signals alike. In this paper we decompose the NCD and introduce a method to measure compression-based similarity without the need to use compression. The Length Delimiting Dictionary Distance (LD鲁) takes the one component essential in compression methods, the dictionary generation, and strips the NCD of all dispensable components. The LD鲁 performs "compression based pattern recognition without compression", keeping all of the above benefits of the NCD while achieving better speed and recognition rates. We first review the NCD, introduce LD鲁 as the "essence" of NCD, and evaluate the LD鲁 based on language tree experiments, authorship recognition, and genome phylogeny data.