What's new? what's certain? - scoring search results in the presence of overlapping data sources

Authors:
Philipp Hussels;Silke Trißl;Ulf Leser
Affiliations:
Humboldt-Universität zu Berlin, Institute of Computer Sciences, Berlin, Germany;Humboldt-Universität zu Berlin, Institute of Computer Sciences, Berlin, Germany;Humboldt-Universität zu Berlin, Institute of Computer Sciences, Berlin, Germany
Venue:
DILS'07 Proceedings of the 4th international conference on Data integration in the life sciences
Year:
2007

Citing 6
Cited 1

Quality-driven Integration of Heterogenous Information Systems

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Using Probabilistic Information in Data Integration

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Seq2Struct: a resource for establishing sequence-structure links

Bioinformatics
Mapping PDB chains to UniProtKB entries

Bioinformatics
A scalable method for integration and functional analysis of multiple microarray datasets

Bioinformatics
Query planning in the presence of overlapping sources

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology

Using medians to generate consensus rankings for biological data

SSDBM'11 Proceedings of the 23rd international conference on Scientific and statistical database management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data integration projects in the life sciences often gather data on a particular subject from multiple sources. Some of these sources overlap to a certain degree. Therefore, integrated search results may be supported by one, few, or all data sources. To reflect these differences, results should be ranked according to the number of data sources that support them. How such a ranking should look like is not clear per se. Either, results supported by only few sources are ranked high because this information is potentially new, or such results are ranked low because the strength of evidence supporting them is limited. We present two scoring schemes to rank search results in the integrated protein annotation database Columba. We define a surprisingness score, preferring results supported by few sources, and a confidence score, preferring frequently encountered information. Unlike many other scoring schemes our proposal is purely data-driven and does not require users to specify preferences among sources. Both scores take the concrete overlaps of data sources into account and do not presume statistical independence. We show how our schemes have been implemented efficiently using SQL.