Masking patterns in sequences: A new class of motif discovery with don't cares

Authors:
Giovanni Battaglia;Davide Cangelosi;Roberto Grossi;Nadia Pisanti
Affiliations:
Dipartimento di Informatica, Università di Pisa, Largo B. Pontecorvo 3, 56127 Pisa, Italy;Dipartimento di Informatica, Università di Pisa, Largo B. Pontecorvo 3, 56127 Pisa, Italy;Dipartimento di Informatica, Università di Pisa, Largo B. Pontecorvo 3, 56127 Pisa, Italy;Dipartimento di Informatica, Università di Pisa, Largo B. Pontecorvo 3, 56127 Pisa, Italy
Venue:
Theoretical Computer Science
Year:
2009

Citing 28
Cited 0

The input/output complexity of sorting and related problems

Communications of the ACM
Usefulness of the Karp-Miller-Rosenberg algorithm in parallel computations on strings and arrays

Theoretical Computer Science
On the complexity of dualization of monotone disjunctive normal forms

Journal of Algorithms
Perfect hashing

Theoretical Computer Science
Data mining, hypergraph transversals, and machine learning (extended abstract)

PODS '97 Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Pattern discovery on character sets and real-valued data: linear bound on irredundant motifs and an efficient polynomial time algorithm

SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
A data structure for manipulating priority queues

Communications of the ACM
Cache Oblivious Distribution Sweeping

ICALP '02 Proceedings of the 29th International Colloquium on Automata, Languages and Programming
Better Filtering with Gapped q-Grams

CPM '01 Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching
Discovering all most specific sentences

ACM Transactions on Database Systems (TODS)
Cache-Oblivious Algorithms

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Rapid identification of repeated patterns in strings, trees and arrays

STOC '72 Proceedings of the fourth annual ACM symposium on Theory of computing
From profiles to patterns and back again: a branch and bound algorithm for finding near optimal motif profiles

RECOMB '04 Proceedings of the eighth annual international conference on Resaerch in computational molecular biology
Proximity Mergesort: optimal in-place sorting in the cache-oblivious model

SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
On spaced seeds for similarity search

Discrete Applied Mathematics
Bases of Motifs for Generating Repeated Patterns with Wild Cards

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Multiseed Lossless Filtration

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Good spaced seeds for homology search

Bioinformatics
Algorithms and analyses for maximal vector computation

The VLDB Journal — The International Journal on Very Large Data Bases
Designing patterns for profile HMM search

Bioinformatics
An efficient implementation of a quasi-polynomial algorithm for generating hypergraph transversals and its application in joint generation

Discrete Applied Mathematics - Special issue: Discrete algorithms and optimization, in honor of professor Toshihide Ibaraki at his retirement from Kyoto University
A fast and flexible approach to oligonucleotide probe design for genomes and gene families

Bioinformatics
Computational aspects of monotone dualization: A brief survey

Discrete Applied Mathematics
Mining Biological Sequences with Masks

DEXA '09 Proceedings of the 2009 20th International Workshop on Database and Expert Systems Application
Subset seed automaton

CIAA'07 Proceedings of the 12th international conference on Implementation and application of automata
Fast computation of good multiple spaced seeds

WABI'07 Proceedings of the 7th international conference on Algorithms in Bioinformatics
Structural analysis of gapped motifs of a string

MFCS'07 Proceedings of the 32nd international conference on Mathematical Foundations of Computer Science
seed-based exclusion method for non-coding RNA gene search

COCOON'07 Proceedings of the 13th annual international conference on Computing and Combinatorics

Quantified Score

Hi-index	5.23

Visualization

Abstract

We introduce a new notion of motifs, called masks, that succinctly represents the repeated patterns for an input sequence T of n symbols drawn from an alphabet @S. We show how to build the set of all frequent maximal masks of length L in O(2^Ln) time and space in the worst case, using the Karp-Miller-Rosenberg approach. We analytically show that our algorithm performs better than the method based on constant-time enumerating and checking all the potential (|@S|+1)^L candidate patterns in T, after a polynomial-time preprocessing of T. Our algorithm is also cache-friendly, attaining O(2^Lsort(n)) block transfers, where sort(n) is the cache complexity of sorting n items.