Linear correlation discovery in databases: a data mining approach
Data & Knowledge Engineering
Hi-index | 0.00 |
Our research takes place in a bioinformatics team embedded in a biological unit where the biologists are using pangenomics cDNA chips to measure expression level of thousands of genes at a time. The goal of our research is to systematically categorize of relations between genes expression levels (1) and biomedical values to support finding of candidate genes allowing a better diagnostic of obesities and related diseases (2). A key issue in the analysis of cDNA chips is that the number of expression levels per chip is very high compared to the number of chips. We are working with 40 cDNA chips with ±40000 spots each one and with 2 biomedical parameters. One way used by biologists to discover relationships between these types of data consists in computing correlations for a small number of them based on their biological knowledge. To go beyond such a biased and manual selection, we propose to explore automatically combinations between all available bioclinical parameters with all gene expressions. These new data need to be classify to identify significant Linear Correlation Discoveries (3). Our method, DISCOCLINI, consists in using abstraction operators to remove outliers, approximation to define correlations and reformulation to describe and to cluster correlations by variations patterns.