Finding similar regions in many sequences

Authors:
Ming Li;Bin Ma;Lusheng Wang
Affiliations:
Department of Computer Science, University of California, Santa Barbara, California;Department of Computer Science, University of Western Ontario, London, Ontario N6A5B7, Canada;City University of Hong Kong, Kowloon, Hong Kong
Venue:
Journal of Computer and System Sciences - STOC 1999
Year:
2002

Citing 6
Cited 30

Multiple alignment, communication cost, and graph matching

SIAM Journal on Applied Mathematics
Randomized algorithms

Randomized algorithms
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
On the closest string and substring problems

Journal of the ACM (JACM)
Computers and Intractability: A Guide to the Theory of NP-Completeness

Computers and Intractability: A Guide to the Theory of NP-Completeness
Approximation Algorithms for Multiple Sequence Alignment

CPM '94 Proceedings of the 5th Annual Symposium on Combinatorial Pattern Matching

Multiple Sequence Alignment as a Facility-Location Problem

INFORMS Journal on Computing
An upper bound on the hardness of exact matrix based motif discovery

Journal of Discrete Algorithms
DNA Motif Representation with Nucleotide Dependency

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
An efficient motif discovery algorithm with unknown motif length and number of binding sites

International Journal of Data Mining and Bioinformatics
On the Structure of Small Motif Recognition Instances

SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval
On the approximability of the Maximum Agreement SubTree and Maximum Compatible Tree problems

Discrete Applied Mathematics
Detecting Motifs in a Large Data Set: Applying Probabilistic Insights to Motif Finding

BICoB '09 Proceedings of the 1st International Conference on Bioinformatics and Computational Biology
Faster Algorithms for Sampling and Counting Biological Sequences

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
On the complexity of finding gapped motifs

Journal of Discrete Algorithms
Segmentation and annotation of audiovisual recordings based on automated speech recognition

IDEAL'07 Proceedings of the 8th international conference on Intelligent data engineering and automated learning
Challenges rising from learning motif evaluation functions using genetic programming

Proceedings of the 12th annual conference on Genetic and evolutionary computation
A Cluster Refinement Algorithm for Motif Discovery

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Why large CLOSEST STRING instances are easy to solve in practice

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
On the hardness of counting and sampling center strings

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Anonymizing binary and small tables is hard to approximate

Journal of Combinatorial Optimization
The bounded search tree algorithm for the closest string problem has quadratic smoothed complexity

MFCS'11 Proceedings of the 36th international conference on Mathematical foundations of computer science
A cost-aggregating integer linear program for motif finding

Journal of Discrete Algorithms
New bounds for motif finding in strong instances

CPM'06 Proceedings of the 17th Annual conference on Combinatorial Pattern Matching
A compact mathematical programming formulation for DNA motif finding

CPM'06 Proceedings of the 17th Annual conference on Combinatorial Pattern Matching
Sharper upper and lower bounds for an approximation scheme for consensus-pattern

CPM'05 Proceedings of the 16th annual conference on Combinatorial Pattern Matching
On the longest common rigid subsequence problem

CPM'05 Proceedings of the 16th annual conference on Combinatorial Pattern Matching
An upper bound on the hardness of exact matrix based motif discovery

CPM'05 Proceedings of the 16th annual conference on Combinatorial Pattern Matching
Randomized algorithms for motif detection

ISAAC'04 Proceedings of the 15th international conference on Algorithms and Computation
Efficient algorithm for mining correlated Protein-DNA binding cores

DASFAA'12 Proceedings of the 17th international conference on Database Systems for Advanced Applications - Volume Part I
Trie-based apriori motif discovery approach

ISBRA'12 Proceedings of the 8th international conference on Bioinformatics Research and Applications
On the closest string via rank distance

CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
On approximating string selection problems with outliers

CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
Identification of distinguishing motifs

CPM'07 Proceedings of the 18th annual conference on Combinatorial Pattern Matching
On the Hardness of Counting and Sampling Center Strings

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
On approximating string selection problems with outliers

Theoretical Computer Science

Quantified Score

Hi-index	0.00

Visualization

Abstract

Algorithms for finding similar, or highly conserved, regions in a group of sequences are at the core of many molecular biology problems. Assume that we are given n DNA sequences s1, ...., sn. The Consensus Patterns problem, which has been widely studied in bioinformatics research, in its simplest form, asks for a region of length L in each si, and a median string s of length L so that the total Hamming distance from s to these regions is minimized. We show that the problem is NP-hard and give a polynomial time approximation scheme (PTAS) for it. We then present an efficient approximation algorithm for the consensus pattern problem under the original relative entropy measure. As an interesting application of our analysis, we further obtain a PTAS for a restricted (but still NP-hard) version of the important consensus alignment problem allowing at most constant number of gaps, each of arbitrary length, in each sequence.