Fast PCA computation in a DBMS with aggregate UDFs and LAPACK

  • Authors:
  • Carlos Ordonez;Naveen Mohanam;Carlos Garcia-Alvarado;Predrag T. Tosic;Edgar Martinez

  • Affiliations:
  • University of Houston, Houston, TX, USA;University of Houston, Houston, TX, USA;University of Houston, Houston, TX, USA;University of Houston, Houston, TX, USA;University of Houston, Houston, TX, USA

  • Venue:
  • Proceedings of the 21st ACM international conference on Information and knowledge management
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Efficient and scalable execution of numerical methods inside a DBMS is difficult as its architecture is not suited for intense numerical computations. We study computing Principal Component Analysis (PCA) on large data sets via Singular Value Decomposition (SVD). Given the difficulty to program and optimize numerical methods on an existing DBMS, we explore an alternative reusability approach: calling the well-known numerical library LAPACK. Thus we study several alternatives to summarize the data set with aggregate User-Defined Functions (UDFs) and how to efficiently call SVD numerical methods available in LAPACK via Stored Procedures (SPs). We propose algorithmic and system optimizations to enhance scalability and to push processing into RAM. We show it is feasible to efficiently solve PCA by first summarizing the data set with arrays incrementally updated with aggregate UDFs and then pushing heavy matrix processing in SVD to RAM calling LAPACK via SPs. We benchmark our solution on a modern DBMS. Our solution requires only one pass on the data set and it exhibits linear scalability.