Communications of the ACM
Tolerance approximation spaces
Fundamenta Informaticae - Special issue: rough sets
A re-examination of text categorization methods
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
A statistical learning learning model of text classification for support vector machines
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval
ECML '98 Proceedings of the 10th European Conference on Machine Learning
Partially Supervised Classification of Text Documents
ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Approximations and Rough Sets Based on Tolerances
RSCTC '00 Revised Papers from the Second International Conference on Rough Sets and Current Trends in Computing
Learning from Positive and Unlabeled Examples
ALT '00 Proceedings of the 11th International Conference on Algorithmic Learning Theory
Ensemble Methods in Machine Learning
MCS '00 Proceedings of the First International Workshop on Multiple Classifier Systems
PAC Learning from Positive Statistical Queries
ALT '98 Proceedings of the 9th International Conference on Algorithmic Learning Theory
One-class svms for document classification
The Journal of Machine Learning Research
PEBL: Web Page Classification without Negative Examples
IEEE Transactions on Knowledge and Data Engineering
A tolerance rough set approach to clustering web search results
PKDD '04 Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases
A rough set approach to classifying web page without negative examples
PAKDD'07 Proceedings of the 11th Pacific-Asia conference on Advances in knowledge discovery and data mining
Nearest neighbor pattern classification
IEEE Transactions on Information Theory
A novel ensemble algorithm for biomedical classification based on Ant Colony Optimization
Applied Soft Computing
Semi-supervised text categorization: Exploiting unlabeled data using ensemble learning algorithms
Intelligent Data Analysis
Hi-index | 12.05 |
Text classification has received more and more attention due to the enormous growth of digital content available on-line. This paper investigates the design of two-class text classifiers using positive and unlabeled data only. The specialty of this problem is that there is no labeled negative example for learning, which makes traditional text classification techniques inapplicable. In this paper, a novel semi-supervised classification algorithm based on tolerance rough set and ensemble learning is proposed. Tolerance rough set theory is used to approximate concepts existed in documents and extract an initial set of negative example. Then, SVM, Rocchio and Naive Bayes algorithms are used as base classifiers to construct an ensemble classifier, which runs iteratively and exploits margins between positive and negative data to progressively improve the approximation of negative data. Thus, the class boundary eventually converges to the true boundary of the positive class in the feature space. An experimental evaluation of different methods is carried out on two common text corpora, i.e., the Reuters-21578 collection and the WebKB collection. The experimental results indicate that the proposed method achieves significant performance improvement.