Text Classification without Labeled Negative Documents

Authors:
Gabriel Pui Cheong Fung;Jeffrey Xu Yu;Hongjun Lu;Philip S. Yu
Affiliations:
Chinese University of Hong Kong;Chinese University of Hong Kong;Hong Kong University of Science and Technology;IBM T. J. Watson Research Centre
Venue:
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Year:
2005

Citing 18
Cited 8

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
Scatter/Gather: a cluster-based approach to browsing large document collections

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
A comparison of classifiers and document representations for the routing problem

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Fast and effective text mining using linear-time document clustering

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Semi-supervised support vector machines

Proceedings of the 1998 conference on Advances in neural information processing systems II
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
A study of thresholding strategies for text categorization

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Exploiting Relations Among Concepts to Acquire Weakly Labeled Training Data

ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
Partially Supervised Classification of Text Documents

ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
Combining Labeled and Unlabeled Data for MultiClass Text Categorization

ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
Refining Initial Points for K-Means Clustering

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
PEBL: positive example based learning for Web page classification using SVM

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Heterogeneous Learner for Web Page Classification

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Building Text Classifiers Using Positive and Unlabeled Examples

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Learning to classify texts using positive and unlabeled data

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence

Parameter free bursty events detection in text streams

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Text Classification without Negative Examples Revisit

IEEE Transactions on Knowledge and Data Engineering
Web dynamics and their ramifications for the development of web search engines

Computer Networks: The International Journal of Computer and Telecommunications Networking - Web dynamics
Building a Text Classifier by a Keyword and Unlabeled Documents

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Building a Text Classifier by a Keyword and Wikipedia Knowledge

ADMA '09 Proceedings of the 5th International Conference on Advanced Data Mining and Applications
Learning to identify unexpected instances in the test set

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
A cost-sensitive technique for positive-example learning supporting content-based product recommendations in B-to-C e-commerce

Decision Support Systems
Sampling the Web as Training Data for Text Classification

International Journal of Digital Library Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a new solution for the problem of building a text classifier with a small set of labeled positive documents (P) and a large set of unlabeled documents (U). Here, the unlabeled documents are mixed with both of the positive and negative documents. In other words, no document is labeled as negative. This makes the task of building a reliable text classifier challenging. In general, the existing approaches for solving this kind of problem use a two-step approach: i) extract the negative documents (N) from U; and ii) build a classifier based on P and N. However, none of the reported studies tries to further extract any positive documents (P驴) from U. Intuitively, extracting P驴 from U will increase the reliability of the classifier. However, extracting P驴 from U is difficult. A document in U that possesses some of the features exhibited in P does not necessarily mean that it is a positive document, and vice versa. It is very sensitive to extract positive documents, because those extracted positive samples may become noises. The very large size of U and the very high diversity exhibited there also contribute to the difficulty of extracting any positive documents. In this paper, we propose a partitionbased heuristic which aims at extracting both of the positive and negative documents in U. Extensive experiments based on three benchmarks are conducted. The favorable results indicated that our proposed heuristic outperforms all of the existing approaches significantly, especially in the case where the size of P is extremely small.