Succinct interval-splitting tree for scalable similarity search of compound-protein pairs with property constraints

Authors:
Yasuo Tabei;Akihiro Kishimoto;Masaaki Kotera;Yoshihiro Yamanishi
Affiliations:
Japan Science and Technology Agency, Kawaguchi, Japan;IBM, Dublin, Ireland;Kyoto University, Uji, Japan;Kyushu University, Fukuoka, Japan
Venue:
Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2013

Citing 14
Cited 0

Min-wise independent permutations (extended abstract)

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Succinct indexable dictionaries with applications to encoding k-ary trees and multisets

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
High-order entropy-compressed text indexes

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
ChemDB: a public database of small molecules and related chemoinformatics resources

Bioinformatics
Practical Rank/Select Queries over Arbitrary Sequences

SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval
Efficient Merging and Filtering Algorithms for Approximate String Searches

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
An Introduction to Chemoinformatics

An Introduction to Chemoinformatics
Efficient Set Similarity Joins Using Min-prefixes

ADBIS '09 Proceedings of the 13th East European Conference on Advances in Databases and Information Systems
b-Bit minwise hashing

Proceedings of the 19th international conference on World wide web
Broadword implementation of rank/select queries

WEA'08 Proceedings of the 7th international conference on Experimental algorithms
An indexing scheme for fast and accurate chemical fingerprint database searching

SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
Theory and applications of b-bit minwise hashing

Communications of the ACM
Efficient similarity joins for near-duplicate detection

ACM Transactions on Database Systems (TODS)
Succinct multibit tree: compact representation of multibit trees by using succinct data structures in chemical fingerprint searches

WABI'12 Proceedings of the 12th international conference on Algorithms in Bioinformatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Analyzing functional interactions between small compounds and proteins is indispensable in genomic drug discovery. Since rich information on various compound-protein inter- actions is available in recent molecular databases, strong demands for making best use of such databases require to in- vent powerful methods to help us find new functional compound-protein pairs on a large scale. We present the succinct interval-splitting tree algorithm (SITA) that efficiently per- forms similarity search in databases for compound-protein pairs with respect to both binary fingerprints and real-valued properties. SITA achieves both time and space efficiency by developing the data structure called interval-splitting trees, which enables to efficiently prune the useless portions of search space, and by incorporating the ideas behind wavelet tree, a succinct data structure to compactly represent trees. We experimentally test SITA on the ability to retrieve similar compound-protein pairs/substrate-product pairs for a query from large databases with over 200 million compound- protein pairs/substrate-product pairs and show that SITA performs better than other possible approaches.