Text mining techniques for leveraging positively labeled data

Authors:
Lana Yeganova;Donald C. Comeau;Won Kim;W. John Wilbur
Affiliations:
National Center for Biotechnology Information, NLM, NIH, Bethesda, MD;National Center for Biotechnology Information, NLM, NIH, Bethesda, MD;National Center for Biotechnology Information, NLM, NIH, Bethesda, MD;National Center for Biotechnology Information, NLM, NIH, Bethesda, MD
Venue:
BioNLP '11 Proceedings of BioNLP 2011 Workshop
Year:
2011

Citing 12
Cited 2

Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
Modern Information Retrieval

Modern Information Retrieval
Toward Optimal Active Learning through Sampling Estimation of Error Reduction

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Support vector machine active learning with applications to text classification

The Journal of Machine Learning Research
Recommender systems using linear classifiers

The Journal of Machine Learning Research
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
Solving large scale linear prediction problems using stochastic gradient descent algorithms

ICML '04 Proceedings of the twenty-first international conference on Machine learning
MedPost: a part-of-speech tagger for bioMedical text

Bioinformatics
Efficient optimization of support vector machine learning parameters for unbalanced datasets

Journal of Computational and Applied Mathematics
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
The foundations of cost-sensitive learning

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2

RankPref: ranking sentences describing relations between biomedical entities with an application

BioNLP '12 Proceedings of the 2012 Workshop on Biomedical Natural Language Processing
Identifying well-formed biomedical phrases in MEDLINE® text

Journal of Biomedical Informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Suppose we have a large collection of documents most of which are unlabeled. Suppose further that we have a small subset of these documents which represent a particular class of documents we are interested in, i.e. these are labeled as positive examples. We may have reason to believe that there are more of these positive class documents in our large unlabeled collection. What data mining techniques could help us find these unlabeled positive examples? Here we examine machine learning strategies designed to solve this problem. We find that a proper choice of machine learning method as well as training strategies can give substantial improvement in retrieving, from the large collection, data enriched with positive examples. We illustrate the principles with a real example consisting of multiword UMLS phrases among a much larger collection of phrases from Medline.