Finding Significant Matches of Position Weight Matrices in Linear Time

Authors:
Cinzia Pizzi;Pasi Rastas;Esko Ukkonen
Affiliations:
University of Padova, Padova;University of Helsinki, Helsinki;University of Helsinki, Helsinki
Venue:
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Year:
2011

Citing 12
Cited 0

Approximate string-matching with q-grams and maximal matches

Theoretical Computer Science - Selected papers of the Combinatorial Pattern Matching School
Text algorithms

Text algorithms
A fast string searching algorithm

Communications of the ACM
Efficient string matching: an aid to bibliographic search

Communications of the ACM
Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences

Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences
Accelerating Protein Classification Using Suffix Trees

Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology
Shift-or string matching with super-alphabets

Information Processing Letters
Using sequence compression to speedup probabilistic profile matching

Bioinformatics
MOODS

Bioinformatics
Fast search algorithms for position specific scoring matrices

BIRD'07 Proceedings of the 1st international conference on Bioinformatics research and development
Algorithms for weighted matching

SPIRE'07 Proceedings of the 14th international conference on String processing and information retrieval
Large scale matching for position weight matrices

CPM'06 Proceedings of the 17th Annual conference on Combinatorial Pattern Matching

Quantified Score

Hi-index	0.00

Visualization

Abstract

Position weight matrices are an important method for modeling signals or motifs in biological sequences, both in DNA and protein contexts. In this paper, we present fast algorithms for the problem of finding significant matches of such matrices. Our algorithms are of the online type, and they generalize classical multipattern matching, filtering, and superalphabet techniques of combinatorial string matching to the problem of weight matrix matching. Several variants of the algorithms are developed, including multiple matrix extensions that perform the search for several matrices in one scan through the sequence database. Experimental performance evaluation is provided to compare the new techniques against each other as well as against some other online and index-based algorithms proposed in the literature. Compared to the brute-force O(mn) approach, our solutions can be faster by a factor that is proportional to the matrix length m. Our multiple-matrix filtration algorithm had the best performance in the experiments. On a current PC, this algorithm finds significant matches (p = 0.0001) of the 123 JASPAR matrices in the human genome in about 18 minutes.