A new approach for similarity queries of biological sequences in databases

  • Authors:
  • Hoong Kee Ng;Kang Ning;Hon Wai Leong

  • Affiliations:
  • Department of Computer Science, National University of Singapore, Singapore;Department of Computer Science, National University of Singapore, Singapore;Department of Computer Science, National University of Singapore, Singapore

  • Venue:
  • PAKDD'07 Proceedings of the 11th Pacific-Asia conference on Advances in knowledge discovery and data mining
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

As biological databases grow larger, effective query of the biological sequences in these databases has become an increasingly important issue for researchers. There are currently not many systems for fast access of very large biological sequences. In this paper, we propose a new approach for biological sequences similarity querying in databases. The general idea is to first transform the biological sequences into vectors and then onto 2-d points in planes; then use a spatial index to index these points with self-organizing maps (SOM), and perform a single efficient similarity query (with multiple simultaneous input sequences) using a fast algorithm, the multi-point range query (MPRQ) algorithm. This approach works well because we could perform multiple sequences similarity queries and return the results with just one MPRQ query, with tremendous savings in query time. We applied our method onto DNA and protein sequences in database, and results show that our algorithm is efficient in time, and the accuracies are satisfactory.