Efficient processing of joins on set-valued attributes
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
A performance study of four index structures for set-valued attributes of low cardinality
The VLDB Journal — The International Journal on Very Large Data Bases
Optimizing bitmap indices with efficient compression
ACM Transactions on Database Systems (TODS)
On the performance of bitmap indices for high cardinality attributes
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Breaking the Curse of Cardinality on Bitmap Indexes
SSDBM '08 Proceedings of the 20th international conference on Scientific and Statistical Database Management
Analyses of multi-level and multi-component compressed bitmap indexes
ACM Transactions on Database Systems (TODS)
Hi-index | 0.00 |
Gene context analysis determines the function of genes by examining the conservation of chromosomal gene clusters and co-occurrence functional profiles across genomes. This is based on the observation that functionally related genes are often collocated on chromosomes as part of so called "gene cassettes", and relies on the identification of such cassettes across a statistically significant and phylogenetically diverse collection of genomes. Gene context analysis is an important part of a genomic data management system such as the Integrated Microbial Genomes (IMG) system, which has one of the largest public genome collections. As of January 2013, IMG contains 3.3 million gene cassettes across 8,000 genomes. A gene context analysis in IMG performs many millions of comparisons among the cassettes and their functions. Using a traditional relational database management system, these cassettes and their functional characteristics are represented by a correlation table of more than 2 billion rows along with a dozen auxiliary tables. This correlation table requires 16.5 hours to build and a typical query requires 5 to 10 minutes to answer. We developed an alternative approach that encodes the cassettes and their functions using bitmaps. Reading the input data now takes about 1.5 hours and constructing the bitmap representations takes only 8 minutes. This amounts to less than one tenth of the time needed to build the correlation table. Furthermore, fairly complex queries can now be answered in seconds. In this work, we considered three basic forms of queries required to support gene context analysis and devised two different bitmap representations to answer such queries. These queries can be answered in less than a second. A more complex query, which we referred to as a "killer query", requires the examination of multi-way cross-products of all cassettes. We developed a progressive pruning strategy that effectively reduces the number of possible combinations examined. Tests have shown that we can now answer "killer queries" in seconds. Even with an extremely complex "killer query" involving 161 genomes (needing a 161-way cross-product), our algorithm took less 10 seconds. A query involving this many genomes is expected to take so much time using a traditional DBMS that it has never been attempted before. Working with the IMG developers, we have verified our implementation and have integrated it into the production version of IMG.