Indexing DNA sequences using q-grams

Authors:
Xia Cao;Shuai Cheng Li;Anthony K. H. Tung
Affiliations:
Department of Computer Science, National University of Singapore;Department of Computer Science, National University of Singapore;Department of Computer Science, National University of Singapore
Venue:
DASFAA'05 Proceedings of the 10th international conference on Database Systems for Advanced Applications
Year:
2005

Citing 11
Cited 4

Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
q-gram based database searching using a suffix array (QUASAR)

RECOMB '99 Proceedings of the third annual international conference on Computational molecular biology
Approximate nearest neighbors and sequence comparison with block operations

STOC '00 Proceedings of the thirty-second annual ACM symposium on Theory of computing
Indexing and Retrieval for Genomic Databases

IEEE Transactions on Knowledge and Data Engineering
Efficient Index Structures for String Databases

Proceedings of the 27th International Conference on Very Large Data Bases
A Database Index to Large Biological Sequences

Proceedings of the 27th International Conference on Very Large Data Bases
Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases

BIBE '03 Proceedings of the 3rd IEEE Symposium on BioInformatics and BioEngineering
Piers: an efficient model for similarity search in DNA sequence databases

ACM SIGMOD Record
The ed-tree: an index for large DNA sequence databases

SSDBM '03 Proceedings of the 15th International Conference on Scientific and Statistical Database Management
OASIS: an online and accurate technique for local-alignment searches on biological sequences

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Linear pattern matching algorithms

SWAT '73 Proceedings of the 14th Annual Symposium on Switching and Automata Theory (swat 1973)

Structural optimization of a full-text n-gram index using relational normalization

The VLDB Journal — The International Journal on Very Large Data Bases
AS-index: a structure for string search using n-grams and algebraic signatures

Proceedings of the 18th ACM conference on Information and knowledge management
Reference-based alignment in large sequence databases

Proceedings of the VLDB Endowment
Metric-space search in bioinformatics

SIGSPATIAL Special

Quantified Score

Hi-index	0.00

Visualization

Abstract

We have observed in recent years a growing interest in similarity search on large collections of biological sequences. Contributing to the interest, this paper presents a method for indexing the DNA sequences efficiently based on q-grams to facilitate similarity search in a DNA database and sidestep the need for linear scan of the entire database. Two level index – hash table and c-trees – are proposed based on the q-grams of DNA sequences. The proposed data structures allow the quick detection of sequences within a certain distance to the query sequence. Experimental results show that our method is efficient in detecting similarity regions in a DNA sequence database with high sensitivity.