Combining clustering and co-training to enhance text classification using unlabelled data

Authors:
Bhavani Raskutti;Herman Ferrá;Adam Kowalczyk
Affiliations:
Telstra Corporation, Clayton, Victoria, Australia;Telstra Corporation, Clayton, Victoria, Australia;Telstra Corporation, Clayton, Victoria, Australia
Venue:
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2002

Citing 10
Cited 16

The nature of statistical learning theory

The nature of statistical learning theory
Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
An introduction to support Vector Machines: and other kernel-based learning methods

An introduction to support Vector Machines: and other kernel-based learning methods
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
Using LSI for text classification in the presence of background text

Proceedings of the tenth international conference on Information and knowledge management
Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond

Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond
Using Unlabelled Data for Text Classification through Addition of Cluster Parameters

ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
Transductive Inference for Text Classification using Support Vector Machines

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Enhancing Supervised Learning with Unlabeled Data

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Learner's Self-Assessment: A Case Study of SVM for Information Retrieval

AI '01 Proceedings of the 14th Australian Joint Conference on Artificial Intelligence: Advances in Artificial Intelligence

CBC: Clustering Based Text Classification Requiring Minimal Labeled Data

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Rule discovery from textual data based on key phrase patterns

Proceedings of the 2004 ACM symposium on Applied computing
Discovering Classification from Data of Multiple Sources

Data Mining and Knowledge Discovery
Active Class Selection

ECML '07 Proceedings of the 18th European conference on Machine Learning
Classification techniques with minimal labelling effort and application to medical reports

International Journal of Data Mining and Bioinformatics
Acquisition of a classification model for a risk search system from unbalanced textual examples

International Journal of Business Intelligence and Data Mining
Towards modeling threaded discussions using induced ontology knowledge

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
Active learning with multiple views

Journal of Artificial Intelligence Research
An e-mail analysis method based on text mining techniques

Applied Soft Computing
Exploratory Consensus of Hierarchical Clusterings for Melanoma and Breast Cancer

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Unsupervised and supervised learning in cascade for petroleum geology

Expert Systems with Applications: An International Journal
Clustering and categorization of Brazilian portuguese legal documents

PROPOR'12 Proceedings of the 10th international conference on Computational Processing of the Portuguese Language
Web classification of conceptual entities using co-training

Expert Systems with Applications: An International Journal
CoNet: feature generation for multi-view semi-supervised learning with partially observed views

Proceedings of the 21st ACM international conference on Information and knowledge management
High performance query expansion using adaptive co-training

Information Processing and Management: an International Journal
The impact of semi-supervised clustering on text classification

Proceedings of the 17th Panhellenic Conference on Informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we present a new co-training strategy that makes use of unlabelled data. It trains two predictors in parallel, with each predictor labelling the unlabelled data for training the other predictor in the next round. Both predictors are support vector machines, one trained using data from the original feature space, the other trained with new features that are derived by clustering both the labelled and unlabelled data. Hence, unlike standard co-training methods, our method does not require a priori the existence of two redundant views either of which can be used for classification, nor is it dependent on the availability of two different supervised learning algorithms that complement each other.We evaluated our method with two classifiers and three text benchmarks: WebKB, Reuters newswire articles and 20 NewsGroups. Our evaluation shows that our co-training technique improves text classification accuracy especially when the number of labelled examples are very few.