Fast UDFs to compute sufficient statistics on large data sets exploiting caching and sampling

  • Authors:
  • Carlos Ordonez;Sasi K. Pitchaimalai

  • Affiliations:
  • University of Houston, Houston, TX 77204, USA;University of Houston, Houston, TX 77204, USA

  • Venue:
  • Data & Knowledge Engineering
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

User-Defined Functions (UDFs) represent an extensibility mechanism provided by most DBMSs, whose execution happens in main memory. Also, UDFs leverage the DBMS multi-threaded capabilities and exploit the C language speed and flexibility for mathematical computations. In this article, we study how to accelerate computation of sufficient statistics on large data sets with UDFs exploiting caching and sampling techniques. We present an aggregate UDF computing multidimensional sufficient statistics that benefit a broad array of statistical models: the linear sum of points and the quadratic sum of cross-products of point dimensions. Caching can be applied when the data set fits in main memory. Otherwise, sampling is required to accelerate processing of very large data sets. Also, sampling can be applied on data sets that can be cached, to further accelerate processing. Experiments carefully analyze performance and accuracy with real and synthetic data sets. We compare UDFs working inside the DBMS and C++ reading flat files, running on the same hardware. We show UDFs can have similar performance to C++, even if both exploit caching and multi-threading. As expected, C++ is much faster than UDFs when the data set is scanned from disk. We carefully analyze the case where sampling is required with larger data sets. We show geometric and bootstrapping sampling techniques can be faster than performing full tables scans, providing high accuracy estimation of mean, variance and correlation. Even further, sampling on cached data sets can provide accurate answers in a few seconds. Detailed experiments illustrate UDF optimizations including diagonal matrix computation, join avoidance and acceleration with a multi-core CPU, when available. A profile of UDF run-time execution shows the UDF is slowed down by I/O when reading from disk.