Semi-supervised document classification with a mislabeling error model

Authors:
Anastasia Krithara;Massih R. Amini;Jean-Michel Renders;Cyril Goutte
Affiliations:
Xerox Research Centre Europe, Meylan, France;University Pierre et Marie Curie, Paris, France;Xerox Research Centre Europe, Meylan, France;National Research Council Canada, Gatineau, QC, Canada
Venue:
ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
Year:
2008

Citing 9
Cited 6

Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Probabilistic latent semantic indexing

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
The use of unlabeled data to improve supervised learning for text summarization

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Unsupervised document classification using sequential information maximization

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Transductive Inference for Text Classification using Support Vector Machines

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
A semisupervised learning method to merge search engine results

ACM Transactions on Information Systems (TOIS)
Semi-supervised learning with explicit misclassification modeling

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence

Using Nearest Neighbor Information to Improve Cross-Language Text Classification

MICAI '09 Proceedings of the 8th Mexican International Conference on Artificial Intelligence
Learning aspect models with partially labeled data

Pattern Recognition Letters
Large-scale hierarchical text classification without labelled data

Proceedings of the fourth ACM international conference on Web search and data mining
Towards the taxonomy-oriented categorization of yellow pages queries

ACM Transactions on Internet Technology (TOIT)
An extension of the aspect PLSA model to active and semi-supervised learning for text classification

SETN'10 Proceedings of the 6th Hellenic conference on Artificial Intelligence: theories, models and applications
A hybrid semi-supervised topic model

IScIDE'11 Proceedings of the Second Sino-foreign-interchange conference on Intelligent Science and Intelligent Data Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper investigates a new extension of the Probabilistic Latent Semantic Analysis (PLSA) model [6] for text classification where the training set is partially labeled. The proposed approach iteratively labels the unlabeled documents and estimates the probabilities of its labeling errors. These probabilities are then taken into account in the estimation of the new model parameters before the next round. Our approach outperforms an earlier semi-supervised extension of PLSA introduced by [9] which is based on the use of fake labels. However, it maintains its simplicity and ability to solve multiclass problems. In addition, it gives valuable information about the most uncertain and difficult classes to label. We perform experiments over the 20Newsgroups, WebKB and Reuters document collections and show the effectiveness of our approach over two other semi-supervised algorithms applied to these text classification problems.