Min-wise independent permutations (extended abstract)
STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Succinct indexable dictionaries with applications to encoding k-ary trees and multisets
SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
High-order entropy-compressed text indexes
SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Practical Rank/Select Queries over Arbitrary Sequences
SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval
Efficient Merging and Filtering Algorithms for Approximate String Searches
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
An Introduction to Chemoinformatics
An Introduction to Chemoinformatics
Efficient Set Similarity Joins Using Min-prefixes
ADBIS '09 Proceedings of the 13th East European Conference on Advances in Databases and Information Systems
Proceedings of the 19th international conference on World wide web
Broadword implementation of rank/select queries
WEA'08 Proceedings of the 7th international conference on Experimental algorithms
An indexing scheme for fast and accurate chemical fingerprint database searching
SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
Theory and applications of b-bit minwise hashing
Communications of the ACM
Efficient similarity joins for near-duplicate detection
ACM Transactions on Database Systems (TODS)
WABI'12 Proceedings of the 12th international conference on Algorithms in Bioinformatics
Hi-index | 0.00 |
Analyzing functional interactions between small compounds and proteins is indispensable in genomic drug discovery. Since rich information on various compound-protein inter- actions is available in recent molecular databases, strong demands for making best use of such databases require to in- vent powerful methods to help us find new functional compound-protein pairs on a large scale. We present the succinct interval-splitting tree algorithm (SITA) that efficiently per- forms similarity search in databases for compound-protein pairs with respect to both binary fingerprints and real-valued properties. SITA achieves both time and space efficiency by developing the data structure called interval-splitting trees, which enables to efficiently prune the useless portions of search space, and by incorporating the ideas behind wavelet tree, a succinct data structure to compactly represent trees. We experimentally test SITA on the ability to retrieve similar compound-protein pairs/substrate-product pairs for a query from large databases with over 200 million compound- protein pairs/substrate-product pairs and show that SITA performs better than other possible approaches.