Estimating True and False Positive Rates in Higher Dimensional Problems and Its Data Mining Applications

Authors:
Andrew Foss;Osmar R. Zaïane
Affiliations:
-;-
Venue:
ICDMW '08 Proceedings of the 2008 IEEE International Conference on Data Mining Workshops
Year:
2008

Citing 0
Cited 1

Quantifying paedophile activity in a large P2P system

Information Processing and Management: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

If we can estimate the accuracy of our observations then we can estimate the true and false positive rates over a series of samples in high dimensional data mining problems. To date such issues have been largely neglected and previously no algorithm has been provided to facilitate the computations involved. In high dimensional data mining tasks, increasing sparsity leads to decreasing true positive rates. Estimating this effect allows the estimation of the true size of membership of a class or cluster allowing us to identify the top candidates for these false negatives, while tracking the likelihood of false positives. These estimates of true and false positive rates can also help researchers avoid unnecessary costs by collecting only the number of samples that are really needed. We propose an algorithm for these computations designated the Statistical Error Rate Algorithm (SERA) and give an example of its use.