ACM SIGIR Forum
Multivariate data analysis (4th ed.): with readings
Multivariate data analysis (4th ed.): with readings
Exploring the similarity space
ACM SIGIR Forum
Foundations of statistical natural language processing
Foundations of statistical natural language processing
Journal of Biomedical Informatics
Assigning Gene Ontology Categories (GO) to Yeast Genes Using Text-Based Supervised Learning Methods
CSB '04 Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference
Biomedical knowledge navigation by literature clustering
Journal of Biomedical Informatics
Knowledge retrieval in the anatomical domain
Proceedings of the 1st ACM International Health Informatics Symposium
Journal of Biomedical Informatics
Hi-index | 0.00 |
Biomedical literature databases constitute valuable repositories of up to date scientific knowledge. The development of efficient machine learning methods in order to facilitate the organization of these databases and the extraction of novel biomedical knowledge is becoming increasingly important. Several of these methods require the representation of the documents as vectors of variables forming large multivariate datasets. Since the amount of information contained in different datasets is voluminous, an open issue is to combine information gained from various sources to a concise new dataset, which will efficiently represent the corpus of documents. This paper investigates the use of the multivariate statistical approach, called Non-Linear Canonical Correlation Analysis (NLCCA), for exploiting the correlation among the variables of different document representations and describing the documents with only one new dataset. Experiments with document datasets represented by text words, Medical Subject Headings (MeSH) and Gene Ontology (GO) terms showed the effectiveness of NLCCA.