A hash trie filter method for approximate string matching in genomic databases

Authors:
Ye-In Chang;Jiun-Rung Chen;Min-Tze Hsu
Affiliations:
Dept. of Computer Science and Engineering, National Sun Yat-Sen University, Kaohsiung, Taiwan 80424;Dept. of Computer Science and Engineering, National Sun Yat-Sen University, Kaohsiung, Taiwan 80424;Dept. of Computer Science and Engineering, National Sun Yat-Sen University, Kaohsiung, Taiwan 80424
Venue:
Applied Intelligence
Year:
2010

Citing 18
Cited 1

A new approach to text searching

Communications of the ACM
Approximate string-matching with q-grams and maximal matches

Theoretical Computer Science - Selected papers of the Combinatorial Pattern Matching School
A fast bit-vector algorithm for approximate string matching based on dynamic programming

Journal of the ACM (JACM)
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
Indexing and Retrieval for Genomic Databases

IEEE Transactions on Knowledge and Data Engineering
Database indexing for large DNA and protein sequence collections

The VLDB Journal — The International Journal on Very Large Data Bases
A Database Index to Large Biological Sequences

Proceedings of the 27th International Conference on Very Large Data Bases
Approximate Pattern Matching with Samples

ISAAC '94 Proceedings of the 5th International Symposium on Algorithms and Computation
Indexing Text with Approximate q-Grams

COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
Approximate String Matching and Local Similarity

CPM '94 Proceedings of the 5th Annual Symposium on Combinatorial Pattern Matching
Filtration with q-Samples in Approximate String Matching

CPM '96 Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching
On Using q-Gram Locations in Approximate String Matching

ESA '95 Proceedings of the Third Annual European Symposium on Algorithms
A seriate coverage filtration approach for homology search

Proceedings of the 2004 ACM symposium on Applied computing
Approximate string matching with ordered q-grams

Nordic Journal of Computing
Estimating the selectivity of approximate string queries

ACM Transactions on Database Systems (TODS)
Optimal spaced seeds for faster approximate string matching

Journal of Computer and System Sciences
An Edit-Distance Model for the Approximate Matching of Timed Strings

IEEE Transactions on Pattern Analysis and Machine Intelligence
Fast bit-vector algorithms for approximate string matching under indel distance

SOFSEM'05 Proceedings of the 31st international conference on Theory and Practice of Computer Science

Pattern matching with wildcards and gap-length constraints based on a centrality-degree graph

Applied Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

In genomic databases, approximate string matching with k errors is often applied when searching genomic sequences, where k errors can be caused by substitution, insertion, or deletion operations. In this paper, we propose a new method, the hash trie filter, to efficiently support approximate string matching in genomic databases. First, we build a hash trie for indexing the genomic sequence stored in a database in advance. Then, we utilize an efficient technique to find the ordered subpatterns in the sequence, which could reduce the number of candidates by pruning some unreasonable matching positions. Moreover, our method will dynamically decide the number of ordered matching grams, resulting in the increase of precision. The simulation results show that the hash trie filter outperforms the well-known (k+s) q-samples filter in terms of the response time, the number of verified candidates, and the precision, under different lengths of the query patterns and different error levels.