Learning to classify text from labeled and unlabeled documents

Authors:
Kamal Nigam;Andrew McCallum;Sebastian Thrun;Tom Mitchell
Affiliations:
-;-;-;-
Venue:
AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Year:
1998

Citing 13
Cited 83

On the exponential value of labeled samples

Pattern Recognition Letters
Training algorithms for linear text classifiers

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Context-sensitive learning methods for text categorization

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Threading electronic mail: a preliminary study

Information Processing and Management: an International Journal - Special issue: methods and tools for the automatic construction of hypertext
Bayesian classification (AutoClass): theory and results

Advances in knowledge discovery and data mining
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss

Machine Learning - Special issue on learning with probabilistic representations
Learning to extract symbolic knowledge from the World Wide Web

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
On Bias, Variance, 0/1—Loss, and the Curse-of-Dimensionality

Data Mining and Knowledge Discovery
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Estimation of Dependences Based on Empirical Data: Springer Series in Statistics (Springer Series in Statistics)

Estimation of Dependences Based on Empirical Data: Springer Series in Statistics (Springer Series in Statistics)
Syskill & webert: Identifying interesting web sites

AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 1

Content-based book recommending using learning for text categorization

DL '00 Proceedings of the fifth ACM conference on Digital libraries
A Machine Learning Approach to POS Tagging

Machine Learning
Text classification in a hierarchical mixture model for small training sets

Proceedings of the tenth international conference on Information and knowledge management
The use of unlabeled data to improve supervised learning for text summarization

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Improving hierarchical text classification using unlabeled data

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Machine Learning for User Modeling

User Modeling and User-Adapted Interaction
Automatic Text Summarization Using Unsupervised and Semi-supervised Learning

PKDD '01 Proceedings of the 5th European Conference on Principles of Data Mining and Knowledge Discovery
Managing Diagnostic Knowledge in Text Cases

ICCBR '01 Proceedings of the 4th International Conference on Case-Based Reasoning: Case-Based Reasoning Research and Development
Machine Learning for Intelligent Information Access

Machine Learning and Its Applications, Advanced Lectures
Interact: A Staged Approach to Customer Service Automation

AI '00 Proceedings of the 13th Biennial Conference of the Canadian Society on Computational Studies of Intelligence: Advances in Artificial Intelligence
Positive and Unlabeled Examples Help Learning

ALT '99 Proceedings of the 10th International Conference on Algorithmic Learning Theory
Extracting Information from the Web for Concept Learning and Collaborative Filtering

ALT '00 Proceedings of the 11th International Conference on Algorithmic Learning Theory
Learning domain-independent string transformation weights for high accuracy object identification

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Mining newsgroups using networks arising from social behavior

WWW '03 Proceedings of the 12th international conference on World Wide Web
Interactive Improvisational Music Companionship: A User-Modeling Approach

User Modeling and User-Adapted Interaction
Learning with progressive transductive support vector machine

Pattern Recognition Letters
Empirical development of an exponential probabilistic model for text retrieval: using textual analysis to build a better model

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Distributional word clusters vs. words for text categorization

The Journal of Machine Learning Research
Clustering documents in a web directory

WIDM '03 Proceedings of the 5th ACM international workshop on Web information and data management
Clinical and financial outcomes analysis with existing hospital patient records

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Bootstrapping for hierarchical document classification

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
On Using Partial Supervision for Text Categorization

IEEE Transactions on Knowledge and Data Engineering
Effect of term distributions on centroid-based text categorization

Information Sciences—Informatics and Computer Science: An International Journal - Special issue: Informatics and computer science intelligent systems applications
Automatic text categorization by unsupervised learning

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Dominant meanings classification model for web information

Design and application of hybrid intelligent systems
Scaling to very very large corpora for natural language disambiguation

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Using artificial anomalies to detect unknown and known network intrusions

Knowledge and Information Systems
Clustering documents into a web directory for bootstrapping a supervised classification

Data & Knowledge Engineering - Special issue: WIDM 2003
Unsupervised named-entity extraction from the web: an experimental study

Artificial Intelligence
Virtual examples for text classification with Support Vector Machines

EMNLP '03 Proceedings of the 2003 conference on Empirical methods in natural language processing
Predicting reading difficulty with statistical language models

Journal of the American Society for Information Science and Technology
Text clustering with extended user feedback

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Reducing the human overhead in text categorization

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
A New Text Categorization Technique Using Distributional Clustering and Learning Logic

IEEE Transactions on Knowledge and Data Engineering
Learning with unlabeled data for text categorization using bootstrapping and feature projection techniques

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Simple, robust, scalable semi-supervised learning via expectation regularization

Proceedings of the 24th international conference on Machine learning
On the strength of hyperclique patterns for text categorization

Information Sciences: an International Journal
Semi-supervised classification with hybrid generative/discriminative methods

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Software quality estimation with limited fault data: a semi-supervised learning perspective

Software Quality Control
An integrated system for building enterprise taxonomies

Information Retrieval
Using unlabeled data to handle domain-transfer problem of semantic detection

Proceedings of the 2008 ACM symposium on Applied computing
The value of agreement a new boosting algorithm

Journal of Computer and System Sciences
Text classification from unlabeled documents with bootstrapping and feature projection techniques

Information Processing and Management: an International Journal
Protein functional class prediction with a combined graph

Expert Systems with Applications: An International Journal
Non-negative matrix factorization for semi-supervised data clustering

Knowledge and Information Systems
Classification techniques with minimal labelling effort and application to medical reports

International Journal of Data Mining and Bioinformatics
Kernel-Based Transductive Learning with Nearest Neighbors

APWeb/WAIM '09 Proceedings of the Joint International Conferences on Advances in Data and Web Management
Adapting Naive Bayes to Domain Adaptation for Sentiment Analysis

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Using scatterplots to understand and improve probabilistic models for text categorization and retrieval

International Journal of Approximate Reasoning
Soft-supervised learning for text classification

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Integrative Windowing

Journal of Artificial Intelligence Research
Unsupervised named-entity extraction from the Web: An experimental study

Artificial Intelligence
Semi-supervised training of least squares support vector machine using a multiobjective evolutionary algorithm

CEC'09 Proceedings of the Eleventh conference on Congress on Evolutionary Computation
Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data

The Journal of Machine Learning Research
Automatic taxonomy generation: issues and possibilities

IFSA'03 Proceedings of the 10th international fuzzy systems association World Congress conference on Fuzzy sets and systems
Content-based recommendation systems

The adaptive web
Multiple label text categorization on a hierarchical thesaurus

EUROCAST'07 Proceedings of the 11th international conference on Computer aided systems theory
A novel reliable negative method based on clustering for learning from positive and unlabeled examples

AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
Combining coregularization and consensus-based self-training for multilingual text categorization

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Mixture model based label association techniques for web accessibility

UIST '10 Proceedings of the 23nd annual ACM symposium on User interface software and technology
Weakly supervised classification of objects in images using soft random forests

ECCV'10 Proceedings of the 11th European conference on Computer vision: Part IV
Constrained parameter estimation for semi-supervised learning: the case of the nearest mean classifier

ECML PKDD'10 Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part II
A self-trained ensemble with semisupervised SVM: An application to pixel classification of remote sensing imagery

Pattern Recognition
Entity disambiguation with hierarchical topic models

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Semi-Supervised Learning with Measure Propagation

The Journal of Machine Learning Research
Distributional features for text categorization

ECML'06 Proceedings of the 17th European conference on Machine Learning
Comparison of documents classification techniques to classify medical reports

PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Learning to separate text content and style for classification

AIRS'06 Proceedings of the Third Asia conference on Information Retrieval Technology
The value of agreement, a new boosting algorithm

COLT'05 Proceedings of the 18th annual conference on Learning Theory
Class normalization in centroid-based text categorization

Information Sciences: an International Journal
Semi-supervised linear discriminant analysis using moment constraints

PSL'11 Proceedings of the First IAPR TC3 conference on Partially Supervised Learning
Learning structural dependencies of words in the Zipfian tail

IWPT '11 Proceedings of the 12th International Conference on Parsing Technologies
A global-ranking local feature selection method for text categorization

Expert Systems with Applications: An International Journal
Building high-performance classifiers using positive and unlabeled examples for text classification

ISNN'12 Proceedings of the 9th international conference on Advances in Neural Networks - Volume Part II
Constrained log-likelihood-based semi-supervised linear discriminant analysis

SSPR'12/SPR'12 Proceedings of the 2012 Joint IAPR international conference on Structural, Syntactic, and Statistical Pattern Recognition
Classifying unlabeled short texts using a fuzzy declarative approach

Language Resources and Evaluation
IFME: information filtering by multiple examples with under-sampling in a digital library environment

Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
Researcher homepage classification using unlabeled data

Proceedings of the 22nd international conference on World Wide Web
Towards anytime active learning: interrupting experts to reduce annotation costs

Proceedings of the ACM SIGKDD Workshop on Interactive Data Exploration and Analytics
On Knowledge-Enhanced Document Clustering

International Journal of Information Retrieval Research
Search by multiple examples

Proceedings of the 7th ACM international conference on Web search and data mining
Unlabeling data can improve classification accuracy

Pattern Recognition Letters
Semi-supervised linear discriminant analysis through moment-constraint parameter estimation

Pattern Recognition Letters

Quantified Score

Hi-index	0.00

Visualization

Abstract

In many important text classification problems, acquiring class labels for training documents is costly, while gathering large quantities of unlabeled data is cheap. This paper shows that the accuracy of text classifiers trained with a small number of labeled documents can be improved by augmenting this small training set with a large pool of unlabeled documents. We present a theoretical argument showing that, under common assumptions, unlabeled data contain information about the target function. We then introduce an algorithm for learning from labeled and unlabeled text based on the combination of Expectation-Maximization with a naive Bayes classifier. The algorithm first trains a classifier using the available labeled documents, and probabilistically labels the unlabeled documents; it then trains a new classifier using the labels for all the documents, and iterates to convergence. Experimental results, obtained using text from three different realworld tasks, show that the use of unlabeled data reduces classification error by up to 33%.