Rapid sequence homology assessment by subsampling the genome space using difference sets

Authors:
Andrzej K. Brodzik
Affiliations:
MITRE Corporation, Bedford, MA
Venue:
IEEE Transactions on Information Theory - Special issue on information theory in molecular biology and neuroscience
Year:
2010

Citing 12
Cited 0

An introduction to Kolmogorov complexity and its applications (2nd ed.)

An introduction to Kolmogorov complexity and its applications (2nd ed.)
BLAST

BLAST
Combinatorial Designs: Constructions and Analysis

Combinatorial Designs: Constructions and Analysis
Remote homology detection based on oligomer distances

Bioinformatics
Handbook of Combinatorial Designs, Second Edition (Discrete Mathematics and Its Applications)

Handbook of Combinatorial Designs, Second Edition (Discrete Mathematics and Its Applications)
Quaternionic periodicity transform: an algebraic solution to the tandem repeat detection problem

Bioinformatics
Superiority of Spaced Seeds for Homology Search

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Fast model-based protein homology detection without alignment

Bioinformatics
Segment-based multiple sequence alignment

Bioinformatics
Phase-only filtering for the masses (of DNA Data): a new approach to sequence alignment

IEEE Transactions on Signal Processing - Part II
Almost difference sets and their sequences with optimal autocorrelation

IEEE Transactions on Information Theory
Existence and nonexistence of almost-perfect autocorrelation sequences

IEEE Transactions on Information Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

Availability of DNA data is growing roughly at the rate specified by Moore's law. In many molecular biology applications this data must be compared with a reference sequence, either to establish similarity of genomes or to identify functionally homologous subsequences. Current approaches based on pairwise sequence alignments are computationally expensive and often data dependent. To ameliorate this problem, alternative, less complex sequence comparison schemes, designed to capture the essential features of genomes, must be explored. In this work a new sequence comparison approach, based on difference set models, is proposed. These models are conceptually appropriate, as they quantify, in a certain sense, two key genome attributes: sequence complexity and symbol repetition. Moreover, it is shown that difference sets are abundant in bacterial genomes and that they coincide with homologous sequence segments. These findings motivate the construction of compact representations of DNA sequences in the difference set space. An alignment of these representations permits computationally efficient identification of differences between the DNA sequences. To illustrate the efficacy of the difference set approach, characterization of indels in closely related bacillus anthracis strains is performed, resulting in the discovery of two previously unreported collections of polymorphisms. In addition to these results, an open problem of extending the difference set approach to difference set and almost difference set families, for the analysis of more distant DNA sequences, is discussed.