Randomized algorithms
The budgeted maximum coverage problem
Information Processing Letters
The state of the art in distributed query processing
ACM Computing Surveys (CSUR)
Access path selection in a relational database management system
SIGMOD '79 Proceedings of the 1979 ACM SIGMOD international conference on Management of data
LEO - DB2's LEarning Optimizer
Proceedings of the 27th International Conference on Very Large Data Bases
Learning response time for WebSources using query feedback and application in query optimization
The VLDB Journal — The International Journal on Very Large Data Bases
A Frequency-based Approach for Mining Coverage Statistics in Data Integration
ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Structure and value synopses for XML data graphs
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
BioScout: a life-science query monitoring system
EDBT '08 Proceedings of the 11th international conference on Extending database technology: Advances in database technology
Query Planning for Searching Inter-dependent Deep-Web Databases
SSDBM '08 Proceedings of the 20th international conference on Scientific and Statistical Database Management
Dynamic Source Selection in Large Scale Mediation Systems
Globe '08 Proceedings of the 1st international conference on Data Management in Grid and Peer-to-Peer Systems
Scalable multi-query optimization for exploratory queries over federated scientific databases
Proceedings of the VLDB Endowment
What's new? what's certain? - scoring search results in the presence of overlapping data sources
DILS'07 Proceedings of the 4th international conference on Data integration in the life sciences
Source selection in large scale data contexts: an optimization approach
DEXA'10 Proceedings of the 21st international conference on Database and expert systems applications: Part I
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Transactions on large-scale data- and knowledge-centered systems III
An extensible light-weight XML-Based monitoring system for sequence databases
DILS'06 Proceedings of the Third international conference on Data Integration in the Life Sciences
Hi-index | 0.00 |
Navigational queries on Web-accessible life science sources pose unique query optimization challenges. The objects in these sources are interconnected to objects in other sources, forming a large and complex graph, and there is an overlap of objects in the sources. Answering a query requires the traversal of multiple alternate paths through these sources. Each path can be associated with the benefit or the cardinality of the target object set (TOS) of objects reached in the result. There is also an evaluation cost of reaching the TOS. We present dual problems in selecting the best set of paths. The first problem is to select a set of paths that satisfy a constraint on the evaluation cost while maximizing the benefit (number of distinct objects in the TOS). The dual problem is to select a set of paths that satisfies a threshold of the TOS benefit with minimal evaluation cost. The two problems can be mapped to the budgeted maximum coverage problem and the maximal set cover with a threshold. To solve these problems, we explore several solutions including greedy heuristics, a randomized search, and a traditional IP/LP formulation with bounds. We perform experiments on a real-world graph of life sciences objects from NCBI and report on the computational overhead of our solutions and their performance compared to the optimal solution.