Topic-specific text filtering based on multiple reducts

  • Authors:
  • Qiang Li;Jianhua Li

  • Affiliations:
  • Modern Communication Institute, Shanghai Jiaotong univ., Shanghai, P.R China;Modern Communication Institute, Shanghai Jiaotong univ., Shanghai, P.R China

  • Venue:
  • AIS-ADM 2005 Proceedings of the 2005 international conference on Autonomous Intelligent Systems: agents and Data Mining
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Feature selection is a very important step in text preprocessing, a good selected feature subset can get the same performance than using full features, at the same time, it reduced the learning time. To make our system fit for the application and to embed this model gateway for real-time text filtering, we need to further select more accurate features. In this paper, we proposed a new feature selection method based on Rough set theory. It generate several reducts, but the special point is that between these reducts there are no common attributes, so these attributes have more powerfully capability to classify new objects, especially for real data set in application. We choose two data sets to evaluate our feature selection method, one is a benchmark data set from UCI machine learning archive, and another is captured from Web. We use statistical classification methods to classify these objects, in the benchmark testing set, we get good precision with a single reduct, but in real date set, we get good precision with several reducts, and the data set is used in our system for topic-specific text filtering. Thus we conclude our method is very effective in application. In addition, we also conclude that SVM and VSM methods get better performance, while Naïve Bayes method get poor performance with the same selected features on non-balance data set.