Applied numerical linear algebra
Applied numerical linear algebra
Advanced Engineering Mathematics: Maple Computer Guide
Advanced Engineering Mathematics: Maple Computer Guide
Locally adaptive dimensionality reduction for indexing large time series databases
ACM Transactions on Database Systems (TODS)
Whole-genome functional classification of genes by latent semantic analysis on microarray data
APBC '04 Proceedings of the second conference on Asia-Pacific bioinformatics - Volume 29
A novel approach to determine normal variation in gene expression data
ACM SIGKDD Explorations Newsletter
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
SVD-based collaborative filtering with privacy
Proceedings of the 2005 ACM symposium on Applied computing
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Jointly Analyzing Gene Expression and Copy Number Data in Breast Cancer Using Data Reduction Models
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Vector and matrix operations programmed with UDFs in a relational DBMS
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Exploiting inter-gene information for microarray data integration
Proceedings of the 2007 ACM symposium on Applied computing
Building statistical models and scoring with UDFs
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Efficient computation of PCA with SVD in SQL
Proceedings of the 2nd Workshop on Data Mining using Matrices and Tensors
Text Mining in Bioinformatics: Research and Application
International Journal of Information Retrieval Research
Hi-index | 0.00 |
Microarray data sets contain expression levels of thousands of genes. The statistical analysis of such data sets is typically performed outside a DBMS with statistical packages or mathematical libraries. In this work, we focus on analyzing them inside the DBMS. This is a difficult problem because microarray data sets have high dimensionality, but small size. First, due to DBMS limitations on a maximum number of columns per table, the data set has to be pivoted and transformed before analysis. More importantly, the correlation matrix on tens of thousands of genes has millions of values. While most high dimensional data sets can be analyzed with the classical PCA method, small, but high dimensional, data sets can only be analyzed with Singular Value Decomposition (SVD). We adapt the Householder tridiagonalization and QR factorization numerical methods to solve SVD inside the DBMS. Since these mathematical methods require many matrix operations, which are hard to express in SQL, query optimizations and efficient UDFs are developed to get good performance. Our proposed techniques achieve processing times comparable with those from the R package, a well-known statistical tool. We experimentally show our methods scale well with high dimensionality.