Automated extraction of mutation data from the literature: application of MuteXt to G protein-coupled receptors and nuclear hormone receptors

Authors:
Florence Horn;Anthony L. Lau;Fred E. Cohen
Affiliations:
Department of Cellular and Molecular Pharmacology;Department of Cellular and Molecular Pharmacology;Department of Cellular and Molecular Pharmacology
Venue:
Bioinformatics
Year:
2004

Citing 0
Cited 6

Enhanced semantic access to the protein engineering literature using ontologies populated by text mining

International Journal of Bioinformatics Research and Applications
Extraction of named entities from tables in gene mutation literature

BioNLP '09 Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing
Algorithm for grounding mutation mentions from text to protein sequences

DILS'10 Proceedings of the 7th international conference on Data integration in the life sciences
Improving phenotype name recognition

Canadian AI'11 Proceedings of the 24th Canadian conference on Advances in artificial intelligence
CONAN: an integrative system for biomedical literature mining

EPIA'05 Proceedings of the 12th Portuguese conference on Progress in Artificial Intelligence
Validating candidate gene-mutation relations in MEDLINE abstracts via crowdsourcing

DILS'12 Proceedings of the 8th international conference on Data Integration in the Life Sciences

Quantified Score

Hi-index	3.84

Visualization

Abstract

Motivation: The amount of genomic and proteomic data that is published daily in the scientific literature is outstripping the ability of experimental scientists to stay current. Reviews, the traditional medium for collating published observations, are also unable to keep pace. For some specific classes of information (e.g. sequences and protein structures), obligatory data deposition policies have helped. However, a great deal of other valuable information is spread throughout the literature hindering coherent access. We are involved in the Molecular Class-Specific Information System (MCSIS) project, a collaborative effort to design and automate the maintenance of protein family databases. The first two databases, the GPCRDB and NucleaRDB, are focused on G protein-coupled receptors (GPCRs) and nuclear hormone receptors (NRs), respectively. The main aim of the MCSIS project is to gather heterogeneous data from across a variety of electronic and literature sources in order to draw new inferences about the target protein families. Results: We present a computational method that identifies and extracts mutation data from the scientific literature. We focused on the extraction of single point mutations for the GPCR and NR superfamilies. After validation by plausibility filters, the mutation data is integrated into the corresponding MCSIS where it is combined with structural and sequence information already stored in these databases. We extracted and validated 2736 true point mutations from 914 articles on GPCRs and 785 true point mutations from 1094 articles on NRs. The current version of our automated extraction algorithm identifies 49.3% of the GPCR point mutations with a specificity of 87.9%, and 64.5% of the NR point mutations with a specificity of 85.8%. MuteXt routinely analyzes 100 electronic articles in approximately 1 h. Availability: Extracted results are available via the GPCRDB and NucleaRDB at http://www.gpcr.org/7tm/mutation/ and http://www.receptors.org/NR/mutation/, respectively. The algorithm is available upon request.