Fast PCA computation in a DBMS with aggregate UDFs and LAPACK

Authors:
Carlos Ordonez;Naveen Mohanam;Carlos Garcia-Alvarado;Predrag T. Tosic;Edgar Martinez
Affiliations:
University of Houston, Houston, TX, USA;University of Houston, Houston, TX, USA;University of Houston, Houston, TX, USA;University of Houston, Houston, TX, USA;University of Houston, Houston, TX, USA
Venue:
Proceedings of the 21st ACM international conference on Information and knowledge management
Year:
2012

Citing 5
Cited 1

Applied numerical linear algebra

Applied numerical linear algebra
A native extension of SQL for mining data streams

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
.NET database programmability and extensibility in microsoft SQL server

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Parallel tiled QR factorization for multicore architectures

Concurrency and Computation: Practice & Experience
Statistical Model Computation with UDFs

IEEE Transactions on Knowledge and Data Engineering

Can we analyze big data inside a DBMS?

Proceedings of the sixteenth international workshop on Data warehousing and OLAP

Quantified Score

Hi-index	0.00

Visualization

Abstract

Efficient and scalable execution of numerical methods inside a DBMS is difficult as its architecture is not suited for intense numerical computations. We study computing Principal Component Analysis (PCA) on large data sets via Singular Value Decomposition (SVD). Given the difficulty to program and optimize numerical methods on an existing DBMS, we explore an alternative reusability approach: calling the well-known numerical library LAPACK. Thus we study several alternatives to summarize the data set with aggregate User-Defined Functions (UDFs) and how to efficiently call SVD numerical methods available in LAPACK via Stored Procedures (SPs). We propose algorithmic and system optimizations to enhance scalability and to push processing into RAM. We show it is feasible to efficiently solve PCA by first summarizing the data set with arrays incrementally updated with aggregate UDFs and then pushing heavy matrix processing in SVD to RAM calling LAPACK via SPs. We benchmark our solution on a modern DBMS. Our solution requires only one pass on the data set and it exhibits linear scalability.