The R*-tree: an efficient and robust access method for points and rectangles
SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
Matching and indexing sequences of different lengths
CIKM '97 Proceedings of the sixth international conference on Information and knowledge management
R-trees: a dynamic index structure for spatial searching
SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
The R+-Tree: A Dynamic Index for Multi-Dimensional Objects
VLDB '87 Proceedings of the 13th International Conference on Very Large Data Bases
Efficient Index Structures for String Databases
Proceedings of the 27th International Conference on Very Large Data Bases
Filtration of String Proximity Search via Transformation
BIBE '03 Proceedings of the 3rd IEEE Symposium on BioInformatics and BioEngineering
Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases
BIBE '03 Proceedings of the 3rd IEEE Symposium on BioInformatics and BioEngineering
Efficient Filtration of Sequence Similarity Search Through Singular Value Decomposition
BIBE '04 Proceedings of the 4th IEEE Symposium on Bioinformatics and Bioengineering
Improved Gapped Alignment in BLAST
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Efficient q-gram filters for finding all ε-matches over a given length
RECOMB'05 Proceedings of the 9th Annual international conference on Research in Computational Molecular Biology
Hi-index | 0.00 |
Study on biological sequence database similarity searching has received substantial attention in the past decade, especially after the sequencing of the human genome. As a result, with larger and larger increases in database sizes, fast similarity search is becoming an important issue. Transforming sequences into numerical vectors, called sequence descriptors, for storing in a multidimensional data structure is becoming a promising method for indexing bio-sequences. In this paper, we present an effective sequence transformation method, called SD (Sequence Descriptor) which uses multiple features of a sequence including Count, RPD (Relative Position Dispersion), and APD (Absolute Position Dispersion) to represent the original sequence data. In contrast to the q-gram transformation method, this avoids the problem of exponentially growing vector size. Also, we present a transformation, called ST (Segment Transformation), which recursively divides sequence data into equal length subsequences, and concatenates them after transformation of the subsequences. Experiments on human genome data show that our transformation method is more effective than the q-gram transformation method.