A Framework for Semisupervised Feature Generation and Its Applications in Biomedical Literature Mining

Authors:
Yanpeng Li;Xiaohua Hu;Hongfei Lin;Zhiahi Yang
Affiliations:
Dalian University of Technology, Dalian and Drexel University, Philadelphia;Drexel University, Philadelphia;Dalian University of Technology, Dalian;Dalian University of Technology, Dalian
Venue:
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Year:
2011

Citing 12
Cited 1

Word association norms, mutual information, and lexicography

Computational Linguistics
Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Unsupervised named-entity extraction from the web: an experimental study

Artificial Intelligence
BioThesaurus: a web-based thesaurus of protein and gene names

Bioinformatics
A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data

The Journal of Machine Learning Research
Self-taught learning: transfer learning from unlabeled data

Proceedings of the 24th international conference on Machine learning
Integrating high dimensional bi-directional parsing models for gene mention tagging

Bioinformatics
Evaluating contributions of natural language parsers to protein–protein interaction extraction

Bioinformatics
Computing semantic relatedness using Wikipedia-based explicit semantic analysis

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Comparative experiments on learning information extractors for proteins and their interactions

Artificial Intelligence in Medicine
A rich feature vector for protein-protein interaction extraction from multiple corpora

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 - Volume 1
Efficiently inducing features of conditional random fields

UAI'03 Proceedings of the Nineteenth conference on Uncertainty in Artificial Intelligence

Gene Classification Using Parameter-Free Semi-Supervised Manifold Learning

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Feature representation is essential to machine learning and text mining. In this paper, we present a feature coupling generalization (FCG) framework for generating new features from unlabeled data. It selects two special types of features, i.e., example-distinguishing features (EDFs) and class-distinguishing features (CDFs) from original feature set, and then generalizes EDFs into higher-level features based on their coupling degrees with CDFs in unlabeled data. The advantage is: EDFs with extreme sparsity in labeled data can be enriched by their co-occurrences with CDFs in unlabeled data so that the performance of these low-frequency features can be greatly boosted and new information from unlabeled can be incorporated. We apply this approach to three tasks in biomedical literature mining: gene named entity recognition (NER), protein-protein interaction extraction (PPIE), and text classification (TC) for gene ontology (GO) annotation. New features are generated from over 20 GB unlabeled PubMed abstracts. The experimental results on BioCreative 2, AIMED corpus, and TREC 2005 Genomics Track show that 1) FCG can utilize well the sparse features ignored by supervised learning. 2) It improves the performance of supervised baselines by 7.8 percent, 5.0 percent, and 5.8 percent, respectively, in the tree tasks. 3) Our methods achieve 89.1, 64.5 F-score, and 60.1 normalized utility on the three benchmark data sets.