Fast algorithms for comprehensive n-point correlation estimates

  • Authors:
  • William B. March;Andrew J. Connolly;Alexander G. Gray

  • Affiliations:
  • Georgia Institute of Technology, Atlanta, GA, USA;University of Washington, Seattle, WA, USA;Georgia Institute of Technology, Atlanta, GA, USA

  • Venue:
  • Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

The n-point correlation functions (npcf) are powerful spatial statistics capable of fully characterizing any set of multidimensional points. These functions are critical in key data analyses in astronomy and materials science, among other fields, for example to test whether two point sets come from the same distribution and to validate physical models and theories. For example, the npcf has been used to study the phenomenon of dark energy, considered one of the major breakthroughs in recent scientific discoveries. Unfortunately, directly estimating the continuous npcf at a single value requires O(Nn) time for $N$ points, and n may be 2, 3, 4 or even higher, depending on the sensitivity required. In order to draw useful conclusions about real scientific problems, we must repeat this expensive computation both for many different scales in order to derive a smooth estimate and over many different subsamples of our data in order to bound the variance. We present the first comprehensive approach to the entire n-point correlation function estimation problem, including fast algorithms for the computation at multiple scales and for many subsamples. We extend the current state-of-the-art tree-based approach with these two algorithms. We show an order-of-magnitude speedup over the current best approach with each of our new algorithms and show that they can be used together to obtain over 500x speedups over the state-of-the-art in order to enable much larger datasets and more accurate scientific analyses than were possible previously.