A new algorithm for data compression
The C Users Journal
Algorithms on strings, trees, and sequences: computer science and computational biology
Algorithms on strings, trees, and sequences: computer science and computational biology
A compression algorithm for DNA sequences and its applications in genome comparison
RECOMB '00 Proceedings of the fourth annual international conference on Computational molecular biology
A fast string searching algorithm
Communications of the ACM
Compressed Pattern Matching in DNA Sequences
CSB '04 Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference
Deterministic finite automata characterization and optimization for scalable pattern matching
ACM Transactions on Architecture and Code Optimization (TACO)
Hi-index | 0.00 |
Exact pattern searching in DNA sequence databases has applications in identification of highly conserved regulatory sequences, the design of hybridization probes, and improving performance of approximate homology searching tools such as BLAST and BLAT. We propose a new pattern searching algorithm, Compressed-Punctuated-Boyer-Moore (cp-BM), to enhance exact pattern match searches of DNA sequences. cp-BM encodes two bits to represent each A, T, C, G character (4-character 8 bit (4C8B) compression), plus punctuator characters to indicate unambiguously the encoding frame of the compressed target sequence, thereby solving the misalignment problem in searching patterns with ordinary 4C8B compression. cp-BM searches DNA patterns at least 6 times faster than AGREP for pattern lengths ≥ 128 and between 2-fold and 5-fold faster than d-BM for all pattern lengths. cp-BM's performance is enhanced by punctuator indexing and multiple punctuators, especially for short sequences, yielding greater than 10-fold enhancements compared to d-BM and AGREP. In addition, cp-BM outperformed BLAT for sequences 64 or more bases in length, and was more than three-fold faster for 256 base sequences.