AZuRE, a Scalable System for Automated Term Disambiguation of Gene and Protein Names

  • Authors:
  • Raf M. Podowski;John G. Cleary;Nicholas T. Goncharoff;Gregory Amoutzias;William S. Hayes

  • Affiliations:
  • AstraZeneca R&D Boston and Karolinska Institutet;Reel Two, Ltd. and University of Waikato;Reel Two, Inc.;AstraZeneca;AstraZeneca R&D Boston

  • Venue:
  • CSB '04 Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Researchers, hindered by a lack of standard gene and protein-naming conventions, endure long, sometimes fruitless, literature searches. A system is described which is able to automatically assign gene names to their LocusLink ID (LLID) in previously unseen MEDLINE abstracts. The system is based on supervised learning and builds a model for each LLID. The training sets for all LLIDs are extracted automatically from MEDLINE references in the LocusLink and SwissProt databases. A validation was done of the performance for all 20,546 human genes with LLIDs. Of these, 7,344 produced good quality models (F-measure 0.7, nearly 60% of which were 0.9) and 13,202 did not, mainly due to insufficient numbers of known document references. A hand validation of MEDLINE documents for a set of 66 genes agreed well with the systemýs internal accuracy assessment. It is concluded that it is possible to achieve high quality gene disambiguation using scaleable automated techniques.