A new challenge for compression algorithms: genetic sequences
Information Processing and Management: an International Journal - Special issue: data compression
An introduction to Kolmogorov complexity and its applications (2nd ed.)
An introduction to Kolmogorov complexity and its applications (2nd ed.)
Towards parameter-free data mining
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
On Complexity Measures for Biological Sequences
CSB '04 Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference
Substring compression problems
SODA '05 Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms
Fast search in DNA sequence databases using punctuation and indexing
ACST'06 Proceedings of the 2nd IASTED international conference on Advances in computer science and technology
International Journal of Bioinformatics Research and Applications
Structural Entropic Difference: A Bounded Distance Metric for Unordered Trees
SISAP '09 Proceedings of the 2009 Second International Workshop on Similarity Search and Applications
A Distance Measure for Genome Phylogenetic Analysis
AI '09 Proceedings of the 22nd Australasian Joint Conference on Advances in Artificial Intelligence
Causal inference using the algorithmic Markov condition
IEEE Transactions on Information Theory
Relative Lempel-Ziv compression of genomes for large-scale storage and retrieval
SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
An extended assessment of type-3 clones as detected by state-of-the-art tools
Software Quality Control
A new method to construct phylogenetic tree from proteins
SMO'05 Proceedings of the 5th WSEAS international conference on Simulation, modelling and optimization
Searching a pattern in compressed DNA sequences
International Journal of Bioinformatics Research and Applications
Iterative Dictionary Construction for Compression of Large DNA Data Sets
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Similarity of objects and the meaning of words
TAMC'06 Proceedings of the Third international conference on Theory and Applications of Models of Computation
Information theoretic approaches to whole genome phylogenies
RECOMB'05 Proceedings of the 9th Annual international conference on Research in Computational Molecular Biology
Biological networks: comparison, conservation, and evolutionary trees
RECOMB'06 Proceedings of the 10th annual international conference on Research in Computational Molecular Biology
Revisiting bounded context block-sorting transformations
Software—Practice & Experience
ACM Computing Surveys (CSUR)
Optimized relative Lempel-Ziv compression of genomes
ACSC '11 Proceedings of the Thirty-Fourth Australasian Computer Science Conference - Volume 113
Hi-index | 0.06 |
We present a lossless compression algorithm, Gen-Compress, for DNA sequences, based on searching for approximate repeats. Our algorithm achieves the best compression ratios for benchmark DNA sequences, comparing to other DNA compression programs [3, 7]. Significantly better compression results show that the approximate repeats are one of the main hidden regularities in DNA sequences.We then describe a theory of measuring the relatedness between two DNA sequences. We propose to use d(x, y) = 1 — K(x) - K(x|y)/K(xy to measure the distance of any two sequences, where K stands for Kolmogorov complexity [5]. Here, K(x) - K(x|y) is the mutual information shared by x and y. But mutual information is not a distance, there is no triangle inequality. The distance d(x, y) is symmetric. It also satisfies the triangle inequality, and furthermore, it is universal [4].It has not escaped our notice that the distance measure we have postulated can be immediately used to construct evolutionary trees from DNA sequences, especially those that cannot be aligned, such as complete genomes. With more and more genomes sequenced, constructing trees from genomes becomes possible [1, 2, 6, 8]. Kolmogorov complexity is not computable. We use GenCompress to approximate it. We present strong experimental support for this theory, and demonstrate its applicability by correctly constructing a 16S (18S) rRNA tree, and a whole genome tree for several species of bacteria. Larger scale experiments are underway at the University of Waterloo, with very promising results.