Text classification from positive and unlabeled documents

Authors:
Hwanjo Yu;ChengXiang Zhai;Jiawei Han
Affiliations:
University of Illinois, IL;University of Illinois, IL;University of Illinois, IL
Venue:
CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Year:
2003

Citing 17
Cited 16

Representation and learning in information retrieval

Representation and learning in information retrieval
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
A statistical learning learning model of text classification for support vector machines

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
A study of thresholding strategies for text categorization

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Bayesian online classifiers for text classification and filtering

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Partially Supervised Classification of Text Documents

ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
Transductive Inference for Text Classification using Support Vector Machines

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
A Machine Learning Approach to Building Domain-Specific Search Engines

IJCAI '99 Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence
PEBL: positive example based learning for Web page classification using SVM

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
One-class svms for document classification

The Journal of Machine Learning Research
Uniform object generation for optimizing one-class classifiers

The Journal of Machine Learning Research
Training ν-Support Vector Classifiers: Theory and Algorithms

Neural Computation
New Support Vector Algorithms

Neural Computation
SVMC: single-class classification with support vector machines

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence

Automatic new topic identification using multiple linear regression

Information Processing and Management: an International Journal
A partially supervised classification approach to dominant and recessive human disease gene prediction

Computer Methods and Programs in Biomedicine
The link-prediction problem for social networks

Journal of the American Society for Information Science and Technology
Learning Bayesian classifiers from positive and unlabeled examples

Pattern Recognition Letters
Mutually beneficial learning with application to on-line news classification

Proceedings of the ACM first Ph.D. workshop in CIKM
Using the shape recovery method to evaluate indexing techniques

Journal of the American Society for Information Science and Technology
Automatic record linkage using seeded nearest neighbour and support vector machine classification

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Imbalanced text classification: A term weighting approach

Expert Systems with Applications: An International Journal
Incremental data-driven learning of a novelty detection model for one-class classification with application to high-dimensional noisy data

Machine Learning
PORE: positive-only relation extraction from wikipedia text

ISWC'07/ASWC'07 Proceedings of the 6th international The semantic web and 2nd Asian conference on Asian semantic web conference
Measuring the interestingness of articles in a limited user environment

Information Processing and Management: an International Journal
Iterative extreme learning machine for single class classifier using general mapping convergence framework

ACS'06 Proceedings of the 6th WSEAS international conference on Applied computer science
A pairwise ranking based approach to learning with positive and unlabeled examples

Proceedings of the 20th ACM international conference on Information and knowledge management
Leveraging one-class SVM and semantic analysis to detect anomalous content

ISI'05 Proceedings of the 2005 IEEE international conference on Intelligence and Security Informatics
Sampling the Web as Training Data for Text Classification

International Journal of Digital Library Systems
Learning from data streams with only positive and unlabeled data

Journal of Intelligent Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most existing studies of text classification assume that the training data are completely labeled. In reality, however, many information retrieval problems can be more accurately described as learning a binary classifier from a set of incompletely labeled examples, where we typically have a small number of labeled positive examples and a very large number of unlabeled examples. In this paper, we study such a problem of performing Text Classification WithOut labeled Negative data TC-WON). In this paper, we explore an efficient extension of the standard Support Vector Machine (SVM) approach, called SVMC (Support Vector Mapping Convergence) [17]for the TC-WON tasks. Our analyses show that when the positive training data is not too under-sampled, SVMC significantly outperforms other methods because SVMC basically exploits the natural "gap" between positive and negative documents in the feature space, which eventually corresponds to improving the generalization performance. In the text domain there are likely to exist many gaps in the feature space because a document is usually mapped to a sparse and high dimensional feature space. However, as the number of positive training data decreases, the boundary of SVMC starts overfitting at some point and end up generating very poor results.This is because when the positive training data is too few, the boundary over-iterates and trespasses the natural gaps between positive and negative class in the feature space and thus ends up fitting tightly around the few positive training data.