Microarray data analysis with PCA in a DBMS

Authors:
Waree Rinsurongkawong;Carlos Ordonez
Affiliations:
University of Houston, Houston, TX, USA;University of Houston, Houston, TX, USA
Venue:
Proceedings of the 2nd international workshop on Data and text mining in bioinformatics
Year:
2008

Citing 12
Cited 2

Applied numerical linear algebra

Applied numerical linear algebra
Advanced Engineering Mathematics: Maple Computer Guide

Advanced Engineering Mathematics: Maple Computer Guide
Locally adaptive dimensionality reduction for indexing large time series databases

ACM Transactions on Database Systems (TODS)
Whole-genome functional classification of genes by latent semantic analysis on microarray data

APBC '04 Proceedings of the second conference on Asia-Pacific bioinformatics - Volume 29
A novel approach to determine normal variation in gene expression data

ACM SIGKDD Explorations Newsletter
Text Mining Biomedical Literature for Discovering Gene-to-Gene Relationships: A Comparative Study of Algorithms

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
SVD-based collaborative filtering with privacy

Proceedings of the 2005 ACM symposium on Applied computing
Dimension Reduction-Based Penalized Logistic Regression for Cancer Classification Using Microarray Data

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Jointly Analyzing Gene Expression and Copy Number Data in Breast Cancer Using Data Reduction Models

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Vector and matrix operations programmed with UDFs in a relational DBMS

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Exploiting inter-gene information for microarray data integration

Proceedings of the 2007 ACM symposium on Applied computing
Building statistical models and scoring with UDFs

Proceedings of the 2007 ACM SIGMOD international conference on Management of data

Efficient computation of PCA with SVD in SQL

Proceedings of the 2nd Workshop on Data Mining using Matrices and Tensors
Text Mining in Bioinformatics: Research and Application

International Journal of Information Retrieval Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

Microarray data sets contain expression levels of thousands of genes. The statistical analysis of such data sets is typically performed outside a DBMS with statistical packages or mathematical libraries. In this work, we focus on analyzing them inside the DBMS. This is a difficult problem because microarray data sets have high dimensionality, but small size. First, due to DBMS limitations on a maximum number of columns per table, the data set has to be pivoted and transformed before analysis. More importantly, the correlation matrix on tens of thousands of genes has millions of values. While most high dimensional data sets can be analyzed with the classical PCA method, small, but high dimensional, data sets can only be analyzed with Singular Value Decomposition (SVD). We adapt the Householder tridiagonalization and QR factorization numerical methods to solve SVD inside the DBMS. Since these mathematical methods require many matrix operations, which are hard to express in SQL, query optimizations and efficient UDFs are developed to get good performance. Our proposed techniques achieve processing times comparable with those from the R package, a well-known statistical tool. We experimentally show our methods scale well with high dimensionality.