AZuRE, a Scalable System for Automated Term Disambiguation of Gene and Protein Names

Authors:
Raf M. Podowski;John G. Cleary;Nicholas T. Goncharoff;Gregory Amoutzias;William S. Hayes
Affiliations:
AstraZeneca R&D Boston and Karolinska Institutet;Reel Two, Ltd. and University of Waikato;Reel Two, Inc.;AstraZeneca;AstraZeneca R&D Boston
Venue:
CSB '04 Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference
Year:
2004

Citing 4
Cited 6

Machine Learning

Machine Learning
Distinguishing systems and distinguishing senses: new evaluation methods for Word Sense Disambiguation

Natural Language Engineering
Unsupervised word sense disambiguation rivaling supervised methods

ACL '95 Proceedings of the 33rd annual meeting on Association for Computational Linguistics
Word-sense disambiguation using statistical models of Roget's categories trained on large corpora

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 2

Evaluation of techniques for increasing recall in a dictionary approach to gene and protein name identification

Journal of Biomedical Informatics
A system for finding biological entities that satisfy certain conditions from texts

Proceedings of the 17th ACM conference on Information and knowledge management
Knowledge-based gene symbol disambiguation

Proceedings of the 2nd international workshop on Data and text mining in bioinformatics
A priority model for named entities

BioNLP '06 Proceedings of the Workshop on Linking Natural Language Processing and Biology: Towards Deeper Biological Literature Analysis
Combining multiple evidence for gene symbol disambiguation

BioNLP '07 Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing
A priority model for named entities

LNLBioNLP '06 Proceedings of the HLT-NAACL BioNLP Workshop on Linking Natural Language and Biology

Quantified Score

Hi-index	0.00

Visualization

Abstract

Researchers, hindered by a lack of standard gene and protein-naming conventions, endure long, sometimes fruitless, literature searches. A system is described which is able to automatically assign gene names to their LocusLink ID (LLID) in previously unseen MEDLINE abstracts. The system is based on supervised learning and builds a model for each LLID. The training sets for all LLIDs are extracted automatically from MEDLINE references in the LocusLink and SwissProt databases. A validation was done of the performance for all 20,546 human genes with LLIDs. Of these, 7,344 produced good quality models (F-measure 0.7, nearly 60% of which were 0.9) and 13,202 did not, mainly due to insufficient numbers of known document references. A hand validation of MEDLINE documents for a set of 66 genes agreed well with the systemýs internal accuracy assessment. It is concluded that it is possible to achieve high quality gene disambiguation using scaleable automated techniques.