Filtering bio-sequence based on sequence descriptor

Authors:
Te-Wen Hsieh;Huang-Cheng Kuo;Jen-Peng Huang
Affiliations:
Department of Computer Science and Information Engineering, National Chiayi University, Taiwan;Department of Computer Science and Information Engineering, National Chiayi University, Taiwan;Department of Information Management, Southern Taiwan University of Technology, Taiwan
Venue:
BioDM'06 Proceedings of the 2006 international conference on Data Mining for Biomedical Applications
Year:
2006

Citing 10
Cited 0

The R*-tree: an efficient and robust access method for points and rectangles

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
Matching and indexing sequences of different lengths

CIKM '97 Proceedings of the sixth international conference on Information and knowledge management
R-trees: a dynamic index structure for spatial searching

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
The R+-Tree: A Dynamic Index for Multi-Dimensional Objects

VLDB '87 Proceedings of the 13th International Conference on Very Large Data Bases
Efficient Index Structures for String Databases

Proceedings of the 27th International Conference on Very Large Data Bases
Filtration of String Proximity Search via Transformation

BIBE '03 Proceedings of the 3rd IEEE Symposium on BioInformatics and BioEngineering
Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases

BIBE '03 Proceedings of the 3rd IEEE Symposium on BioInformatics and BioEngineering
Efficient Filtration of Sequence Similarity Search Through Singular Value Decomposition

BIBE '04 Proceedings of the 4th IEEE Symposium on Bioinformatics and Bioengineering
Improved Gapped Alignment in BLAST

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Efficient q-gram filters for finding all ε-matches over a given length

RECOMB'05 Proceedings of the 9th Annual international conference on Research in Computational Molecular Biology

Quantified Score

Hi-index	0.00

Visualization

Abstract

Study on biological sequence database similarity searching has received substantial attention in the past decade, especially after the sequencing of the human genome. As a result, with larger and larger increases in database sizes, fast similarity search is becoming an important issue. Transforming sequences into numerical vectors, called sequence descriptors, for storing in a multidimensional data structure is becoming a promising method for indexing bio-sequences. In this paper, we present an effective sequence transformation method, called SD (Sequence Descriptor) which uses multiple features of a sequence including Count, RPD (Relative Position Dispersion), and APD (Absolute Position Dispersion) to represent the original sequence data. In contrast to the q-gram transformation method, this avoids the problem of exponentially growing vector size. Also, we present a transformation, called ST (Segment Transformation), which recursively divides sequence data into equal length subsequences, and concatenates them after transformation of the subsequences. Experiments on human genome data show that our transformation method is more effective than the q-gram transformation method.