Applied numerical linear algebra
Applied numerical linear algebra
A native extension of SQL for mining data streams
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
.NET database programmability and extensibility in microsoft SQL server
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Parallel tiled QR factorization for multicore architectures
Concurrency and Computation: Practice & Experience
Statistical Model Computation with UDFs
IEEE Transactions on Knowledge and Data Engineering
Can we analyze big data inside a DBMS?
Proceedings of the sixteenth international workshop on Data warehousing and OLAP
Hi-index | 0.00 |
Efficient and scalable execution of numerical methods inside a DBMS is difficult as its architecture is not suited for intense numerical computations. We study computing Principal Component Analysis (PCA) on large data sets via Singular Value Decomposition (SVD). Given the difficulty to program and optimize numerical methods on an existing DBMS, we explore an alternative reusability approach: calling the well-known numerical library LAPACK. Thus we study several alternatives to summarize the data set with aggregate User-Defined Functions (UDFs) and how to efficiently call SVD numerical methods available in LAPACK via Stored Procedures (SPs). We propose algorithmic and system optimizations to enhance scalability and to push processing into RAM. We show it is feasible to efficiently solve PCA by first summarizing the data set with arrays incrementally updated with aggregate UDFs and then pushing heavy matrix processing in SVD to RAM calling LAPACK via SPs. We benchmark our solution on a modern DBMS. Our solution requires only one pass on the data set and it exhibits linear scalability.