Suffix arrays: a new method for on-line string searches
SIAM Journal on Computing
q-gram based database searching using a suffix array (QUASAR)
RECOMB '99 Proceedings of the third annual international conference on Computational molecular biology
Approximate nearest neighbors and sequence comparison with block operations
STOC '00 Proceedings of the thirty-second annual ACM symposium on Theory of computing
Indexing and Retrieval for Genomic Databases
IEEE Transactions on Knowledge and Data Engineering
Efficient Index Structures for String Databases
Proceedings of the 27th International Conference on Very Large Data Bases
A Database Index to Large Biological Sequences
Proceedings of the 27th International Conference on Very Large Data Bases
Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases
BIBE '03 Proceedings of the 3rd IEEE Symposium on BioInformatics and BioEngineering
The ed-tree: an index for large DNA sequence databases
SSDBM '03 Proceedings of the 15th International Conference on Scientific and Statistical Database Management
OASIS: an online and accurate technique for local-alignment searches on biological sequences
VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Linear pattern matching algorithms
SWAT '73 Proceedings of the 14th Annual Symposium on Switching and Automata Theory (swat 1973)
Structural optimization of a full-text n-gram index using relational normalization
The VLDB Journal — The International Journal on Very Large Data Bases
AS-index: a structure for string search using n-grams and algebraic signatures
Proceedings of the 18th ACM conference on Information and knowledge management
Reference-based alignment in large sequence databases
Proceedings of the VLDB Endowment
Metric-space search in bioinformatics
SIGSPATIAL Special
Hi-index | 0.00 |
We have observed in recent years a growing interest in similarity search on large collections of biological sequences. Contributing to the interest, this paper presents a method for indexing the DNA sequences efficiently based on q-grams to facilitate similarity search in a DNA database and sidestep the need for linear scan of the entire database. Two level index – hash table and c-trees – are proposed based on the q-grams of DNA sequences. The proposed data structures allow the quick detection of sequences within a certain distance to the query sequence. Experimental results show that our method is efficient in detecting similarity regions in a DNA sequence database with high sensitivity.