Assigning roles to protein mentions: The case of transcription factors

Authors:
Hui Yang;John Keane;Casey M. Bergman;Goran Nenadic
Affiliations:
School of Computer Science, University of Manchester, UK;School of Computer Science, University of Manchester, UK;Faculty of Life Sciences, University of Manchester, UK;School of Computer Science, University of Manchester, UK
Venue:
Journal of Biomedical Informatics
Year:
2009

Citing 17
Cited 2

The Frame-Based Module of the SUISEKI Information Extraction System

IEEE Intelligent Systems
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Automatic Extraction of Biological Information from Scientific Text: Protein-Protein Interactions

Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology
Shallow parsing with conditional random fields

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Extracting human protein interactions from MEDLINE using a full-sentence parser

Bioinformatics
ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text

Bioinformatics
Discovering patterns to extract protein–protein interactions from the literature: Part II

Bioinformatics
Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
ORegAnno: an open access database and curation system for literature-derived promoters, transcription factor binding sites and regulatory variation

Bioinformatics
Extraction of regulatory gene/protein networks from Medline

Bioinformatics
FlyTF: a systematic review of site-specific transcription factors in the fruit fly Drosophila melanogaster

Bioinformatics
Extracting Protein-Protein Interaction Information from Biomedical Text with SVM

IEICE - Transactions on Information and Systems
EBIMed---text crunching to gather facts for proteins from Medline

Bioinformatics
MedEvi

Bioinformatics
Biomedical named entity recognition using conditional random fields and rich feature sets

JNLPBA '04 Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications
Using maximum entropy model to extract protein-protein interaction information from biomedical literature

ICIC'07 Proceedings of the intelligent computing 3rd international conference on Advanced intelligent computing theories and applications
Developing a robust part-of-speech tagger for biomedical text

PCI'05 Proceedings of the 10th Panhellenic conference on Advances in Informatics

Guest Editorial: Current issues in biomedical text mining and natural language processing

Journal of Biomedical Informatics
Mining methodologies from NLP publications: A case study in automatic terminology recognition

Computer Speech and Language

Quantified Score

Hi-index	0.00

Visualization

Abstract

Transcription factors (TFs) play a crucial role in gene regulation, and providing structured and curated information about them is important for genome biology. Manual curation of TF related data is time-consuming and always lags behind the actual knowledge available in the biomedical literature. Here we present a machine-learning text mining approach for identification and tagging of protein mentions that play a TF role in a given context to support the curation process. More precisely, the method explicitly identifies those protein mentions in text that refer to their potential TF functions. The prediction features are engineered from the results of shallow parsing and domain-specific processing (recognition of relevant appearing in phrases) and a phrase-based Conditional Random Fields (CRF) model is used to capture the content and context information of candidate entities. The proposed approach for the identification of TF mentions has been tested on a set of evidence sentences from the TRANSFAC and FlyTF databases. It achieved an F-measure of around 51.5% with a precision of 62.5% using 5-fold cross-validation evaluation. The experimental results suggest that the phrase-based CRF model benefits from the flexibility to use correlated domain-specific features that describe the dependencies between TFs and other entities. To the best of our knowledge, this work is one of the first attempts to apply text-mining techniques to the task of assigning semantic roles to protein mentions.