Efficient Computation of Statistical Significance of Query Results in Databases

Authors:
Vishwakarma Singh;Arnab Bhattacharya;Ambuj K. Singh
Affiliations:
Department of Computer Science, University of California, Santa Barbara, USA;Department of Computer Science and Engineering, Indian Institute of Technology (I.I.T.), Kanpur, India;Department of Computer Science, University of California, Santa Barbara, USA
Venue:
SSDBM '08 Proceedings of the 20th international conference on Scientific and Statistical Database Management
Year:
2008

Citing 5
Cited 0

Automatic text processing

Automatic text processing
R-trees: a dynamic index structure for spatial searching

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
Optimal aggregation algorithms for middleware

Journal of Computer and System Sciences - Special issu on PODS 2001
A Framework for Grid-Based Image Retrieval

ICPR '04 Proceedings of the Pattern Recognition, 17th International Conference on (ICPR'04) Volume 2 - Volume 02
Content-based image retrieval: approaches and trends of the new age

Proceedings of the 7th ACM SIGMM international workshop on Multimedia information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Queries such as database similarity searches return results satisfying certain properties of distances or scores. For domain scientists, the absolute values of scores are seldom sufficient. Statistical significance or p-valueof the result is a more useful criterion. This can be computed using an appropriate model of random objects. The problem of computing p-values becomes more acute when queries have multiple components. In this case, the returned score is an aggregate of individual scores. The simple way of calculating the p-value by enumerating all random possibilities fails for large database and query sizes. We propose an efficient method to calculate the approximate p-value of a multi-attribute result when the distribution of scores for the database objects is non-parametric. Experimental evaluation on large databases shows that our method is practical, runs 5 orders of magnitude faster than the basic approach, and has an error of less than 5% in p-value computation.