Rough set and ensemble learning based semi-supervised algorithm for text classification

  • Authors:
  • Lei Shi;Xinming Ma;Lei Xi;Qiguo Duan;Jingying Zhao

  • Affiliations:
  • College of Information and Management Science, HeNan Agricultural University, Zhengzhou 450002, China;College of Information and Management Science, HeNan Agricultural University, Zhengzhou 450002, China;College of Information and Management Science, HeNan Agricultural University, Zhengzhou 450002, China;Zhengzhou Commodity Exchange, Zhengzhou 450008, China;Department of Computer Science and Engineering, Dalian Nationalities University, Dalian 116600, China

  • Venue:
  • Expert Systems with Applications: An International Journal
  • Year:
  • 2011

Quantified Score

Hi-index 12.05

Visualization

Abstract

Text classification has received more and more attention due to the enormous growth of digital content available on-line. This paper investigates the design of two-class text classifiers using positive and unlabeled data only. The specialty of this problem is that there is no labeled negative example for learning, which makes traditional text classification techniques inapplicable. In this paper, a novel semi-supervised classification algorithm based on tolerance rough set and ensemble learning is proposed. Tolerance rough set theory is used to approximate concepts existed in documents and extract an initial set of negative example. Then, SVM, Rocchio and Naive Bayes algorithms are used as base classifiers to construct an ensemble classifier, which runs iteratively and exploits margins between positive and negative data to progressively improve the approximation of negative data. Thus, the class boundary eventually converges to the true boundary of the positive class in the feature space. An experimental evaluation of different methods is carried out on two common text corpora, i.e., the Reuters-21578 collection and the WebKB collection. The experimental results indicate that the proposed method achieves significant performance improvement.