Term-weighting approaches in automatic text retrieval
Information Processing and Management: an International Journal
Scatter/Gather: a cluster-based approach to browsing large document collections
SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
A comparison of classifiers and document representations for the routing problem
SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Fast and effective text mining using linear-time document clustering
KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
A re-examination of text categorization methods
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Semi-supervised support vector machines
Proceedings of the 1998 conference on Advances in neural information processing systems II
Text Classification from Labeled and Unlabeled Documents using EM
Machine Learning - Special issue on information retrieval
A study of thresholding strategies for text categorization
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Machine learning in automated text categorization
ACM Computing Surveys (CSUR)
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features
ECML '98 Proceedings of the 10th European Conference on Machine Learning
Exploiting Relations Among Concepts to Acquire Weakly Labeled Training Data
ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
Partially Supervised Classification of Text Documents
ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
Combining Labeled and Unlabeled Data for MultiClass Text Categorization
ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
Refining Initial Points for K-Means Clustering
ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
PEBL: positive example based learning for Web page classification using SVM
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Heterogeneous Learner for Web Page Classification
ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Building Text Classifiers Using Positive and Unlabeled Examples
ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Learning to classify texts using positive and unlabeled data
IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Parameter free bursty events detection in text streams
VLDB '05 Proceedings of the 31st international conference on Very large data bases
Text Classification without Negative Examples Revisit
IEEE Transactions on Knowledge and Data Engineering
Web dynamics and their ramifications for the development of web search engines
Computer Networks: The International Journal of Computer and Telecommunications Networking - Web dynamics
Building a Text Classifier by a Keyword and Unlabeled Documents
PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Building a Text Classifier by a Keyword and Wikipedia Knowledge
ADMA '09 Proceedings of the 5th International Conference on Advanced Data Mining and Applications
Learning to identify unexpected instances in the test set
IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Sampling the Web as Training Data for Text Classification
International Journal of Digital Library Systems
Hi-index | 0.00 |
This paper presents a new solution for the problem of building a text classifier with a small set of labeled positive documents (P) and a large set of unlabeled documents (U). Here, the unlabeled documents are mixed with both of the positive and negative documents. In other words, no document is labeled as negative. This makes the task of building a reliable text classifier challenging. In general, the existing approaches for solving this kind of problem use a two-step approach: i) extract the negative documents (N) from U; and ii) build a classifier based on P and N. However, none of the reported studies tries to further extract any positive documents (P驴) from U. Intuitively, extracting P驴 from U will increase the reliability of the classifier. However, extracting P驴 from U is difficult. A document in U that possesses some of the features exhibited in P does not necessarily mean that it is a positive document, and vice versa. It is very sensitive to extract positive documents, because those extracted positive samples may become noises. The very large size of U and the very high diversity exhibited there also contribute to the difficulty of extracting any positive documents. In this paper, we propose a partitionbased heuristic which aims at extracting both of the positive and negative documents in U. Extensive experiments based on three benchmarks are conducted. The favorable results indicated that our proposed heuristic outperforms all of the existing approaches significantly, especially in the case where the size of P is extremely small.