How accurate are the extremely small P-values used in genomic research: An evaluation of numerical libraries

Authors:
Sai Santosh Bangalore;Jelai Wang;David B. Allison
Affiliations:
The University of Alabama at Birmingham, Section on Statistical Genetics, Department of Biostatistics, RPHB 327, 1665 University Boulevard, Birmingham, AL-35294-0022, USA;The University of Alabama at Birmingham, Section on Statistical Genetics, Department of Biostatistics, RPHB 327, 1665 University Boulevard, Birmingham, AL-35294-0022, USA;The University of Alabama at Birmingham, Section on Statistical Genetics, Department of Biostatistics, RPHB 327, 1665 University Boulevard, Birmingham, AL-35294-0022, USA
Venue:
Computational Statistics & Data Analysis
Year:
2009

Citing 2
Cited 1

A comparative study of the reliability of nine statistical software packages

Computational Statistics & Data Analysis
The accuracy of statistical distributions in Microsoft®Excel 2007

Computational Statistics & Data Analysis

Significant motifs in time series

Statistical Analysis and Data Mining

Quantified Score

Hi-index	0.03

Visualization

Abstract

In the fields of genomics and high-dimensional biology (HDB), massive multiple testing prompts the use of extremely small significance levels. Because tail areas of statistical distributions are needed for hypothesis testing, the accuracy of these areas is important to confidently make scientific judgments. Previous work on accuracy was primarily focused on evaluating professionally written statistical software, like SAS, on the Statistical Reference Datasets (StRD) provided by the National Institute of Standards and Technology (NIST) and on the accuracy of tail areas in statistical distributions. The goal of this paper is to provide guidance to investigators, who are developing their own custom scientific software built upon numerical libraries written by others. Specifically, we evaluate the accuracy of small tail areas from cumulative distribution functions (CDF) of the Chi-square and t-distribution by comparing several open-source, free, or commercially licensed numerical libraries in Java, C, and R to widely accepted standards of comparison like ELV and DCDFLIB. In our evaluation, the C libraries and R functions are consistently accurate up to six significant digits. Amongst the evaluated Java libraries, Colt is the most accurate. These languages and libraries are popular choices among programmers developing scientific software, so the results herein can be useful to programmers in choosing libraries for CDF accuracy.