Gene annotation from scientific literature using mappings between keyword systems

  • Authors:
  • Antonio J. Pérez;Carolina Perez-Iratxeta;Peer Bork;Guillermo Thode;Miguel A. Andrade

  • Affiliations:
  • University of Málaga, Facultad de Ciencias, Departmento de Genetica, Group of Bioinformatics, Campus Universitario de Teatinos, 29071 Málaga, Spain,;European Molecular Biology Laboratory, Meyerhofstrasse 1, 69117 Heidelberg, Germany;European Molecular Biology Laboratory, Meyerhofstrasse 1, 69117 Heidelberg, Germany;University of Málaga, Facultad de Ciencias, Departmento de Genetica, Group of Bioinformatics, Campus Universitario de Teatinos, 29071 Málaga, Spain,;European Molecular Biology Laboratory, Meyerhofstrasse 1, 69117 Heidelberg, Germany

  • Venue:
  • Bioinformatics
  • Year:
  • 2004

Quantified Score

Hi-index 3.84

Visualization

Abstract

Motivation: The description of genes in databases by keywords helps the non-specialist to quickly grasp the properties of a gene and increases the efficiency of computational tools that are applied to gene data (e.g. searching a gene database for sequences related to a particular biological process). However, the association of keywords to genes or protein sequences is a difficult process that ultimately implies examination of the literature related to a gene. Results: To support this task, we present a procedure to derive keywords from the set of scientific abstracts related to a gene. Our system is based on the automated extraction of mappings between related terms from different databases using a model of fuzzy associations that can be applied with all generality to any pair of linked databases. We tested the system by annotating genes of the SWISS-PROT database with keywords derived from the abstracts linked to their entries (stored in the MEDLINE database of scientific references). The performance of the annotation procedure was much better for SWISS-PROT keywords (recall of 47%, precision of 68%) than for Gene Ontology terms (recall of 8%, precision of 67%). Availability: The algorithm can be publicly accessed and used for the annotation of sequences through a web server at http://www.bork.embl.de/kat