A multi-view approach to semi-supervised document classification with incremental Naive Bayes

Authors:
Ping Gu;QingSheng Zhu;Cheng Zhang
Affiliations:
ChongQing University, Institute of Computer Science and Technology, ChongQing 400044, PR China;ChongQing University, Institute of Computer Science and Technology, ChongQing 400044, PR China;ChongQing University, Institute of Computer Science and Technology, ChongQing 400044, PR China
Venue:
Computers & Mathematics with Applications
Year:
2009

Citing 10
Cited 2

Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
Analyzing the effectiveness and applicability of co-training

Proceedings of the ninth international conference on Information and knowledge management
Transductive Inference for Text Classification using Support Vector Machines

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Enhancing Supervised Learning with Unlabeled Data

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Selective Sampling with Redundant Views

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
Employing EM and Pool-Based Active Learning for Text Classification

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
A simple rule-based part of speech tagger

ANLC '92 Proceedings of the third conference on Applied natural language processing
Word sense disambiguation using Conceptual Density

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1
Text Classification by Boosting Weak Learners based on Terms and Concepts

ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining

Granulation-based symbolic representation of time series and semi-supervised classification

Computers & Mathematics with Applications
Semi-supervised text categorization: Exploiting unlabeled data using ensemble learning algorithms

Intelligent Data Analysis

Quantified Score

Hi-index	0.09

Visualization

Abstract

Many semi-supervised learning algorithms only consider the distribution of word frequency, ignoring the semantic and syntactic information underlying the documents. In this paper, we present a new multi-view approach for semi-supervised document classification by incorporating both semantic and syntactic information. For this purpose, a co-training style algorithm, Co-features, is proposed. In the phase of active querying, we assign a weight to each sample document according to its uncertainty factor. Then the most informative samples are selected and labeled by other ''teachers''. In contrast to batch training mode, we developed an incremental Naive Bayes update method, which allows for more efficient training even with a large pool of unlabeled data. Experimental results show that our algorithm works successfully on the datasets Reuters-21578 and WebKB, and is superior to Co-testing in the learning efficiency.