Non-linear correlation of content and metadata information extracted from biomedical article datasets

Authors:
Theodosios Theodosiou;Lefteris Angelis;Athena Vakali
Affiliations:
Department of Informatics, School of Natural Sciences, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece;Department of Informatics, School of Natural Sciences, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece;Department of Informatics, School of Natural Sciences, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece
Venue:
Journal of Biomedical Informatics
Year:
2008

Citing 8
Cited 2

Another stemmer

ACM SIGIR Forum
Multivariate data analysis (4th ed.): with readings

Multivariate data analysis (4th ed.): with readings
Exploring the similarity space

ACM SIGIR Forum
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Synergy between medical informatics and bioinformatics: facilitating genomic medicine for future health care

Journal of Biomedical Informatics
Assigning Gene Ontology Categories (GO) to Yeast Genes Using Text-Based Supervised Learning Methods

CSB '04 Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference
Automatic assignment of biomedical categories: toward a generic approach

Bioinformatics
Biomedical knowledge navigation by literature clustering

Journal of Biomedical Informatics

Knowledge retrieval in the anatomical domain

Proceedings of the 1st ACM International Health Informatics Symposium
MeSHy: Mining unanticipated PubMed information using frequencies of occurrences and concurrences of MeSH terms

Journal of Biomedical Informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Biomedical literature databases constitute valuable repositories of up to date scientific knowledge. The development of efficient machine learning methods in order to facilitate the organization of these databases and the extraction of novel biomedical knowledge is becoming increasingly important. Several of these methods require the representation of the documents as vectors of variables forming large multivariate datasets. Since the amount of information contained in different datasets is voluminous, an open issue is to combine information gained from various sources to a concise new dataset, which will efficiently represent the corpus of documents. This paper investigates the use of the multivariate statistical approach, called Non-Linear Canonical Correlation Analysis (NLCCA), for exploiting the correlation among the variables of different document representations and describing the documents with only one new dataset. Experiments with document datasets represented by text words, Medical Subject Headings (MeSH) and Gene Ontology (GO) terms showed the effectiveness of NLCCA.