A statistical methodology for analyzing co-occurrence data from a large sample

  • Authors:
  • Hui Cao;George Hripcsak;Marianthi Markatou

  • Affiliations:
  • Department of Biomedical Informatics, 622 West 168th Street, VC-5, Columbia University, New York, NY 10032, USA and Life Sciences and Health Care, Deloitte Consulting LLP, USA;Department of Biomedical Informatics, 622 West 168th Street, VC-5, Columbia University, New York, NY 10032, USA;Department of Biostatistics, Columbia University, New York, NY, USA

  • Venue:
  • Journal of Biomedical Informatics
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Determining important associations among items in a large database is challenging due to multiple simultaneous hypotheses and the ability to select weak associations that are statistically but not clinically significant. The simple application of the @g^2 test among all possible pairs of items results in mostly inappropriate associations surpassing the traditional (@a=.05, @g^2=3.94) threshold. One can choose a stricter threshold to find stronger associations, but the choice may be arbitrary. We combined the volume test of Diaconis and Efron with a p-value plot to select a more rigorous and less arbitrary threshold. The volume test adjusts the p-value of the @g^2-statistic. A plot of adjusted p-values (1-p versus N"p), where N"p is the number of test statistics with a p-value greater than p, should be linear if there are no true associations. The point where the plot deviates from a line can be used as a threshold. We used linear regression to select the threshold in a reproducible fashion. In one experiment, we found that the method selected a threshold similar to that previously obtained by manually reviewing associations.