Word association norms, mutual information, and lexicography
Computational Linguistics
Combining labeled and unlabeled data with co-training
COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Unsupervised named-entity extraction from the web: an experimental study
Artificial Intelligence
A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data
The Journal of Machine Learning Research
Self-taught learning: transfer learning from unlabeled data
Proceedings of the 24th international conference on Machine learning
Computing semantic relatedness using Wikipedia-based explicit semantic analysis
IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Comparative experiments on learning information extractors for proteins and their interactions
Artificial Intelligence in Medicine
A rich feature vector for protein-protein interaction extraction from multiple corpora
EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 - Volume 1
Efficiently inducing features of conditional random fields
UAI'03 Proceedings of the Nineteenth conference on Uncertainty in Artificial Intelligence
Gene Classification Using Parameter-Free Semi-Supervised Manifold Learning
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Hi-index | 0.00 |
Feature representation is essential to machine learning and text mining. In this paper, we present a feature coupling generalization (FCG) framework for generating new features from unlabeled data. It selects two special types of features, i.e., example-distinguishing features (EDFs) and class-distinguishing features (CDFs) from original feature set, and then generalizes EDFs into higher-level features based on their coupling degrees with CDFs in unlabeled data. The advantage is: EDFs with extreme sparsity in labeled data can be enriched by their co-occurrences with CDFs in unlabeled data so that the performance of these low-frequency features can be greatly boosted and new information from unlabeled can be incorporated. We apply this approach to three tasks in biomedical literature mining: gene named entity recognition (NER), protein-protein interaction extraction (PPIE), and text classification (TC) for gene ontology (GO) annotation. New features are generated from over 20 GB unlabeled PubMed abstracts. The experimental results on BioCreative 2, AIMED corpus, and TREC 2005 Genomics Track show that 1) FCG can utilize well the sparse features ignored by supervised learning. 2) It improves the performance of supervised baselines by 7.8 percent, 5.0 percent, and 5.8 percent, respectively, in the tree tasks. 3) Our methods achieve 89.1, 64.5 F-score, and 60.1 normalized utility on the three benchmark data sets.