An indexing scheme for fast and accurate chemical fingerprint database searching

Authors:
Zeyar Aung;See-Kiong Ng
Affiliations:
Institute for Infocomm Research, Agency for Science, Technology and Research, Connexis, Singapore;Institute for Infocomm Research, Agency for Science, Technology and Research, Connexis, Singapore
Venue:
SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
Year:
2010

Citing 10
Cited 2

Level search schemes for information filtering and retrieval

Information Processing and Management: an International Journal
Database indexing for large DNA and protein sequence collections

The VLDB Journal — The International Journal on Very Large Data Bases
Video Sequence Matching with Spatio-Temporal Constraints

ICPR '04 Proceedings of the Pattern Recognition, 17th International Conference on (ICPR'04) Volume 3 - Volume 03
Foundations of Multidimensional and Metric Data Structures (The Morgan Kaufmann Series in Computer Graphics and Geometric Modeling)

Foundations of Multidimensional and Metric Data Structures (The Morgan Kaufmann Series in Computer Graphics and Geometric Modeling)
Scaling up all pairs similarity search

Proceedings of the 16th international conference on World Wide Web
Diagnosis of Embedded Software Using Program Spectra

ECBS '07 Proceedings of the 14th Annual IEEE International Conference and Workshops on the Engineering of Computer-Based Systems
Index compression is good, especially for random access

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Efficient similarity joins for near duplicate detection

Proceedings of the 17th international conference on World Wide Web
Introduction to Information Retrieval

Introduction to Information Retrieval
Comparison of nonparametric transformations and bit vector matching for stereo correlation

IWCIA'04 Proceedings of the 10th international conference on Combinatorial Image Analysis

Succinct multibit tree: compact representation of multibit trees by using succinct data structures in chemical fingerprint searches

WABI'12 Proceedings of the 12th international conference on Algorithms in Bioinformatics
Succinct interval-splitting tree for scalable similarity search of compound-protein pairs with property constraints

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Rapid chemical database searching is important for drug discovery. Chemical compounds are represented as long fixed-length bit vectors called fingerprints. The vectors record the presence or absence of particular features or substructures of the corresponding molecules. In a typical drug discovery application, several thousands of query fingerprints are screened for similarity against a database of millions of fingerprints to identify suitable drug candidates. The existing methods of full database scan and range search take considerable amounts of time for such a task. We present a new index-based search method called "Chem-Dex" (Chemical fingerprint in Dexing) for speeding up the fingerprint database search. We propose a novel chain scoring scheme to calculate the Tanimoto (Jaccard) scores of the fingerprints using an early-termination strategy. We tested our proposed method using 1,000 randomly selected query fingerprints on the NCBI PubChem database containing about 19.5 million fingerprints. Experimental results show that ChemDex is up to 109.9 times faster than the full database scan method, and up to 2.1 times faster than the state-of-the-art range search method for memory-based retrieval. For disk-based retrieval, it is up to 145.7 times and 1.7 times faster than the full scan and the range search respectively. The speedup is achieved without any loss of accuracy as ChemDex generates exactly the same results as the full scan and the range search.