Acrophile: an automated acronym extractor and server
DL '00 Proceedings of the fifth ACM conference on Digital libraries
Automatic Extraction of Biological Information from Scientific Text: Protein-Protein Interactions
Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology
Extracting the names of genes and gene products with a hidden Markov model
COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Comparison between tagged corpora for the named entity task
WCC '00 Proceedings of the workshop on Comparing corpora - Volume 9
Term identification in the biomedical literature
Journal of Biomedical Informatics - Special issue: Named entity recognition in biomedicine
Automatic document indexing in large medical collections
HIKM '06 Proceedings of the international workshop on Healthcare information and knowledge management
Journal of Biomedical Informatics
The AMTEx approach in the medical document indexing and retrieval application
Data & Knowledge Engineering
Two learning approaches for protein name extraction
Journal of Biomedical Informatics
Automatic extraction of genomic glossary triggered by query
BioDM'06 Proceedings of the 2006 international conference on Data Mining for Biomedical Applications
A framework for schema-driven relationship discovery from unstructured text
ISWC'06 Proceedings of the 5th international conference on The Semantic Web
Semantic annotation of biomedical literature using google
ICCSA'05 Proceedings of the 2005 international conference on Computational Science and Its Applications - Volume Part III
Identification of related gene/protein names based on an HMM of name variations
Computational Biology and Chemistry
Hi-index | 0.00 |
Motivation. Natural language processing (NLP) techniques are used to extract information automatically from computer-readable literature. In biology, the identification of terms corresponding to biological substances (e.g., genes and proteins) is a necessary step that precedes the application of other NLP systems that extract biological information (e.g., protein-protein interactions, gene regulation events, and biochemical pathways). We have developed GPmarkup (for "gene/protein-full name mark up"), a software system that automatically identifies gene/protein terms (i.e., symbols or full names) in MEDLINE abstracts. As a part of marking up process, we also generated automatically a knowledge source of paired gene/protein symbols and full names (e.g., LARD for lymphocyte associated receptor of death) from MEDLINE. We found that many of the pairs in our knowledge source do not appear in the current GenBank database. Therefore our methods may also be used for automatic lexicon generation.Results. GPmarkup has 73% recall and 93% precision in identifying and marking up gene/protein terms in MEDLINE abstracts.Availability: A random sample of gene/protein symbols and full names and a sample set of marked up abstracts can be viewed at http://www.cpmc.columbia.edu/homepages/yuh9001/GPmarkup/. Contact. hy52@columbia.edu. Voice: 212-939-7028; fax: 212-666-0140.