A hash trie filter method for approximate string matching in genomic databases

  • Authors:
  • Ye-In Chang;Jiun-Rung Chen;Min-Tze Hsu

  • Affiliations:
  • Dept. of Computer Science and Engineering, National Sun Yat-Sen University, Kaohsiung, Taiwan 80424;Dept. of Computer Science and Engineering, National Sun Yat-Sen University, Kaohsiung, Taiwan 80424;Dept. of Computer Science and Engineering, National Sun Yat-Sen University, Kaohsiung, Taiwan 80424

  • Venue:
  • Applied Intelligence
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

In genomic databases, approximate string matching with k errors is often applied when searching genomic sequences, where k errors can be caused by substitution, insertion, or deletion operations. In this paper, we propose a new method, the hash trie filter, to efficiently support approximate string matching in genomic databases. First, we build a hash trie for indexing the genomic sequence stored in a database in advance. Then, we utilize an efficient technique to find the ordered subpatterns in the sequence, which could reduce the number of candidates by pruning some unreasonable matching positions. Moreover, our method will dynamically decide the number of ordered matching grams, resulting in the increase of precision. The simulation results show that the hash trie filter outperforms the well-known (k+s) q-samples filter in terms of the response time, the number of verified candidates, and the precision, under different lengths of the query patterns and different error levels.