Efficient representation of DNA data for pattern recognition using failure factor oracles

Authors:
Loek Cleophas;Derrick G. Kourie;Bruce W. Watson
Affiliations:
University of Pretoria, Hatfield, Pretoria, Republic of South Africa;University of Pretoria, Hatfield, Pretoria, Republic of South Africa;University of Pretoria, Hatfield, Pretoria, Republic of South Africa
Venue:
Proceedings of the South African Institute for Computer Scientists and Information Technologists Conference
Year:
2013

Citing 7
Cited 0

Automata for matching patterns

Handbook of formal languages, vol. 2
Efficient string matching: an aid to bibliographic search

Communications of the ACM
Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences

Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences
Efficient Experimental String Matching by Weak Factor Recognition

CPM '01 Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching
A new taxonomy of sublinear right-to-left scanning keyword pattern matching algorithms

Science of Computer Programming
An algorithm for mapping short reads to a dynamically changing genomic sequence

Journal of Discrete Algorithms
The exact online string matching problem: A review of the most recent results

ACM Computing Surveys (CSUR)

Quantified Score

Hi-index	0.00

Visualization

Abstract

In indexing of and pattern matching on DNA sequences, representing all factors of a sequence is important. One efficient, compact representation is the factor oracle (FO). At the same time, any classical deterministic finite automata (DFA) can be transformed to a so-called failure one (FDFA), which may use failure transitions to replace multiple symbol transitions, potentially yielding a more compact representation. We combine the two ideas and directly construct a failure factor oracle (FFO) from a given sequence, in contrast to ex post facto transformation to an FDFA. The algorithm is suitable for long sequences. We empirically compared the resulting FFOs and FOs on number of transitions for many DNA sequences of lengths 4--512, showing gains of up to 10% in total number of transitions, with failure transitions also taking up less space than symbol transitions. Preliminary results on sequence processing runtimes when using FFOs originally showed these to be multiples of those when using FOs, but partial memoization already leads to drastic improvements. Altogether the results are promising, particularly for the use of FFOs for (repeated) factor detection, where recognition speed may be less important than memory use.