BpMatch: An Efficient Algorithm for a Segmental Analysis of Genomic Sequences

Authors:
Claudio Felicioli;Roberto Marangoni
Affiliations:
Noname Research, Pisa;University of Pisa, Pisa and National Research Council of Italy, Pisa
Venue:
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Year:
2012

Citing 5
Cited 0

String searching algorithms

String searching algorithms
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
Segment Match Refinement and Applications

WABI '02 Proceedings of the Second International Workshop on Algorithms in Bioinformatics
Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching

SIAM Journal on Computing
Linear pattern matching algorithms

SWAT '73 Proceedings of the 14th Annual Symposium on Switching and Automata Theory (swat 1973)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Here, we propose BpMatch: an algorithm that, working on a suitably modified suffix-tree data structure, is able to compute, in a fast and efficient way, the coverage of a source sequence S on a target sequence T, by taking into account direct and reverse segments, eventually overlapped. Using BpMatch, the operator should define a priori, the minimum length l of a segment and the minimum number of occurrences minRep, so that only segments longer than l and having a number of occurrences greater than minRep are considered to be significant. BpMatch outputs the significant segments found and the computed segment-based distance. On the worst case, assuming the alphabet dimension d is a constant, the time required by BpMatch to calculate the coverage is {\rm O}(l^2n). On the average, by setting l\ge 2\log_d(n), the time required to calculate the coverage is only {\rm O}(n). BpMatch, thanks to the minRep parameter, can also be used to perform a self-covering: to cover a sequence using segments coming from itself, by avoiding the trivial solution of having a single segment coincident with the whole sequence. The result of the self-covering approach is a spectral representation of the repeats contained in the sequence. BpMatch is freely available on: www.sourceforge.net/projects/bpmatch/.