Clustering the normalized compression distance for influenza virus data

Authors:
Kimihito Ito;Thomas Zeugmann;Yu Zhu
Affiliations:
Research Center for Zoonosis Control, Hokkaido University, Sapporo, Japan;Division of Computer Science, Hokkaido University, Japan;Division of Computer Science, Hokkaido University, Japan
Venue:
Algorithms and Applications
Year:
2010

Citing 12
Cited 0

A Factorization Approach to Grouping

ECCV '98 Proceedings of the 5th European Conference on Computer Vision-Volume I - Volume I
Spectral partitioning works: planar graphs and finite element meshes

FOCS '96 Proceedings of the 37th Annual Symposium on Foundations of Computer Science
Multiclass Spectral Clustering

ICCV '03 Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2
Towards parameter-free data mining

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
A tutorial on spectral clustering

Statistics and Computing
Introduction to Information Retrieval

Introduction to Information Retrieval
An Introduction to Kolmogorov Complexity and Its Applications

An Introduction to Kolmogorov Complexity and Its Applications
Clustering pairwise distances with missing data: maximum cuts versus normalized cuts

DS'06 Proceedings of the 9th international conference on Discovery Science
Similarity of objects and the meaning of words

TAMC'06 Proceedings of the Third international conference on Theory and Applications of Models of Computation
Information distance

IEEE Transactions on Information Theory
The similarity metric

IEEE Transactions on Information Theory
Clustering by compression

IEEE Transactions on Information Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

The present paper analyzes the usefulness of the normalized compression distance for the problem to cluster the hemagglutinin (HA) sequences of influenza virus data for the HA gene in dependence on the available compressors. Using the CompLearn Toolkit, the built-in compressors zlib and bzip2 are compared. Moreover, a comparison is made with respect to hierarchical and spectral clustering. For the hierarchical clustering, hclust from the R package is used, and the spectral clustering is done via the kLine algorithm proposed by Fischer and Poland (2004). Our results are very promising and show that one can obtain an (almost) perfect clustering. It turned out that the zlib compressor allowed for better results than the bzip2 compressor and, if all data are concerned, then hierarchical clustering is a bit better than spectral clustering via kLines.