Learning to Find Relevant Biological Articles without Negative Training Examples

Authors:
Keith Noto;Milton H. Saier, Jr.;Charles Elkan
Affiliations:
University of California, La Jolla, CA 92093;University of California, La Jolla, CA 92093;University of California, La Jolla, CA 92093
Venue:
AI '08 Proceedings of the 21st Australasian Joint Conference on Artificial Intelligence: Advances in Artificial Intelligence
Year:
2008

Citing 8
Cited 3

Making large-scale support vector machine learning practical

Advances in kernel methods
Building Text Classifiers Using Positive and Unlabeled Examples

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
A support vector method for multivariate performance measures

ICML '05 Proceedings of the 22nd international conference on Machine learning
Learning from positive and unlabeled examples

Theoretical Computer Science - Algorithmic learning theory (ALT 2000)
PSoL: a positive sample only learning algorithm for finding non-coding RNA genes

Bioinformatics
Substring selection for biomedical document classification

Bioinformatics
Learning classifiers from only positive and unlabeled data

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Finding Transport Proteins in a General Protein Database

PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases

Learning from positive and unlabeled documents for retrieval of bacterial protein-protein interaction literature

ISMB/ECCB'09 Proceedings of the 2009 workshop of the BioLink Special Interest Group, international conference on Linking Literature, Information, and Knowledge for Biology
IFME: information filtering by multiple examples with under-sampling in a digital library environment

Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
Search by multiple examples

Proceedings of the 7th ACM international conference on Web search and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Classifiers are traditionally learned using sets of positive and negative training examples. However, often a classifier is required, but for training only an incomplete set of positive examples and a set of unlabeled examples are available. This is the situation, for example, with the Transport Classification Database (TCDB, www.tcdb.org), a repository of information about proteins involved in transmembrane transport. This paper presents and evaluates a method for learning to rank the likely relevance to TCDB of newly published scientific articles, using the articles currently referenced in TCDB as positive training examples. The new method has succeeded in identifying 964 new articles relevant to TCDB in fewer than six months, which is a major practical success. From a general data mining perspective, the contributions of this paper are (i) evaluating two novel approaches that solve the positive-only problem effectively, (ii) applying support vector machines in a state-of-the-art way for recognizing and ranking relevance, and (iii) deploying a system to update a widely-used, real-world biomedical database. Supplementary information including all data sets are publicly available at www.cs.ucsd.edu/users/knoto/pub/ajcai08.