Database indexing for large DNA and protein sequence collections

Authors:
Ela Hunt;Malcolm P. Atkinson;Robert W. Irving
Affiliations:
Department of Computing Science, University of Glasgow, Glasgow, G12 8QQ, UK/ e-mail: &lcub/ela,mpa,rwi&rcub/&commat/dcs.gla.ac.uk;Department of Computing Science, University of Glasgow, Glasgow, G12 8QQ, UK/ e-mail: &lcub/ela,mpa,rwi&rcub/&commat/dcs.gla.ac.uk;Department of Computing Science, University of Glasgow, Glasgow, G12 8QQ, UK/ e-mail: &lcub/ela,mpa,rwi&rcub/&commat/dcs.gla.ac.uk
Venue:
The VLDB Journal — The International Journal on Very Large Data Bases
Year:
2002

Citing 37
Cited 27

Complete inverted files for efficient text retrieval and analysis

Journal of the ACM (JACM)
Introduction to algorithms

Introduction to algorithms
Boyer-Moore approach to approximate string matching (extended abstract)

SWAT '90 Proceedings of the second Scandinavian workshop on Algorithm theory
A new approach to text searching

Communications of the ACM
Approximate string-matching with q-grams and maximal matches

Theoretical Computer Science - Selected papers of the Combinatorial Pattern Matching School
Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
Efficient implementation of suffix trees

Software—Practice & Experience
Genetic sequence data retrieval and manipulation based on generalized suffix trees

Genetic sequence data retrieval and manipulation based on generalized suffix trees
Fast text searching for regular expressions or automaton searching on tries

Journal of the ACM (JACM)
An orthogonally persistent Java

ACM SIGMOD Record
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Compact pat trees

Compact pat trees
q-gram based database searching using a suffix array (QUASAR)

RECOMB '99 Proceedings of the third annual international conference on Computational molecular biology
The string B-tree: a new data structure for string search in external memory and its applications

Journal of the ACM (JACM)
Efficient suffix trees on secondary storage

Proceedings of the seventh annual ACM-SIAM symposium on Discrete algorithms
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Extracting structured motifs using a suffix tree—algorithms and application to promoter consensus identification

RECOMB '00 Proceedings of the fourth annual international conference on Computational molecular biology
On effective multi-dimensional indexing for strings

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Reducing the space requirement of suffix trees

Software—Practice & Experience
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
Space efficient suffix trees

Journal of Algorithms
An efficient object promotion algorithm for persistent object systems

Software—Practice & Experience
Elementary Computability, Formal Languages and Automata

Elementary Computability, Formal Languages and Automata
Fully Integrated Data Environments: Persistent Programming Languages, Object Stores, and Programmingenvironments

Fully Integrated Data Environments: Persistent Programming Languages, Object Stores, and Programmingenvironments
Accelerating Protein Classification Using Suffix Trees

Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology
Efficient Index Structures for String Databases

Proceedings of the 27th International Conference on Very Large Data Bases
Providing Orthogonal Persistence for Java (Extended Abstract)

ECCOP '98 Proceedings of the 12th European Conference on Object-Oriented Programming
Factor Oracle: A New Structure for Pattern Matching

SOFSEM '99 Proceedings of the 26th Conference on Current Trends in Theory and Practice of Informatics on Theory and Practice of Informatics
Approximate String-Matching over Suffix Trees

CPM '93 Proceedings of the 4th Annual Symposium on Combinatorial Pattern Matching
A New Indexing Method for Approximate String Matching

CPM '99 Proceedings of the 10th Annual Symposium on Combinatorial Pattern Matching
Architecture of the PEVM: A High-Performance Orthogonally Persistent Java Virtual Machine

POS-9 Revised Papers from the 9th International Workshop on Persistent Object Systems
Overcoming the Memory Bottleneck in Suffix Tree Construction

FOCS '98 Proceedings of the 39th Annual Symposium on Foundations of Computer Science
Opportunistic data structures with applications

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Proceedings of the Second International Workshop on Persistence and Java

Proceedings of the Second International Workshop on Persistence and Java
A Review of the Rationale and Architectures of PJama: a Durable, Flexible, Evolvable and Scalable Orthogonally Persistent Programming Platform

A Review of the Rationale and Architectures of PJama: a Durable, Flexible, Evolvable and Scalable Orthogonally Persistent Programming Platform
Orthogonal Persistence for the Java[tm] Platform: Specification and Rationale

Orthogonal Persistence for the Java[tm] Platform: Specification and Rationale

Constructing chromosome scale suffix trees

APBC '04 Proceedings of the second conference on Asia-Pacific bioinformatics - Volume 29
PSIST: Indexing Protein Structures Using Suffix Trees

CSB '05 Proceedings of the 2005 IEEE Computational Systems Bioinformatics Conference
An efficient approach for sequence matching in large DNA databases

Journal of Information Science
An efficient DNA sequence searching method using position specific weighting scheme

Journal of Information Science
A data structure for a sequence of string accesses in external memory

ACM Transactions on Algorithms (TALG)
Survey on index based homology search algorithms

The Journal of Supercomputing
PSIST: A scalable approach to indexing protein structures using suffix trees

Journal of Parallel and Distributed Computing
Towards Efficient Searching on the Secondary Structure of Protein Sequences

Fundamenta Informaticae - Special issue ISMIS'05
VisGenome with CartoonPlus: Supporting large scale genomic analyses via physical space deformation

Future Generation Computer Systems
High throughput and large capacity pipelined dynamic search tree on FPGA

Proceedings of the 18th annual ACM/SIGDA international symposium on Field programmable gate arrays
A practical method for approximate subsequence search in DNA databases

PAKDD'07 Proceedings of the 11th Pacific-Asia conference on Advances in knowledge discovery and data mining
Exhaustive peptide searching using relations

BNCOD'07 Proceedings of the 24th British national conference on Databases
An experimental study of compressed indexing and local alignments of DNA

COCOA'07 Proceedings of the 1st international conference on Combinatorial optimization and applications
A hash trie filter method for approximate string matching in genomic databases

Applied Intelligence
An indexing scheme for fast and accurate chemical fingerprint database searching

SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
ERA: efficient serial and parallel suffix tree construction for very long strings

Proceedings of the VLDB Endowment
Search-Optimized suffix-tree storage for biological applications

HiPC'05 Proceedings of the 12th international conference on High Performance Computing
Obtaining provably good performance from suffix trees in secondary storage

CPM'06 Proceedings of the 17th Annual conference on Combinatorial Pattern Matching
CSI: clustered segment indexing for efficient approximate searching on the secondary structure of protein sequences

ISMIS'05 Proceedings of the 15th international conference on Foundations of Intelligent Systems
A novel indexing method for efficient sequence matching in large DNA database environment

PAKDD'05 Proceedings of the 9th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
On-line suffix tree construction with reduced branching

Journal of Discrete Algorithms
Information retrieval of sequential data in heterogeneous XML databases

AMR'05 Proceedings of the Third international conference on Adaptive Multimedia Retrieval: user, context, and feedback
Towards Efficient Searching on the Secondary Structure of Protein Sequences

Fundamenta Informaticae - Special issue ISMIS'05
Trying to outperform a well-known index with a sequential scan

Proceedings of the Joint EDBT/ICDT 2013 Workshops
Efficient parallel construction of suffix trees for genomes larger than main memory

Proceedings of the 20th European MPI Users' Group Meeting
RACE: a scalable and elastic parallel system for discovering repeats in very long sequences

Proceedings of the VLDB Endowment
Efficient techniques on retrieving bio-information for active U-healthcare

Personal and Ubiquitous Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Our aim is to develop new database technologies for the approximate matching of unstructured string data using indexes. We explore the potential of the suffix tree data structure in this context. We present a new method of building suffix trees, allowing us to build trees in excess of RAM size, which has hitherto not been possible. We show that this method performs in practice as well as the O(n) method of Ukkonen [70]. Using this method we build indexes for 200 Mb of protein and 300 Mbp of DNA, whose disk-image exceeds the available RAM. We show experimentally that suffix trees can be effectively used in approximate string matching with biological data. For a range of query lengths and error bounds the suffix tree reduces the size of the unoptimised O(mn) dynamic programming calculation required in the evaluation of string similarity, and the gain from indexing increases with index size. In the indexes we built this reduction is significant, and less than 0.3% of the expected matrix is evaluated. We detail the requirements for further database and algorithmic research to support efficient use of large suffix indexes in biological applications.