Web Document Classification Based on Rough Set

  • Authors:
  • Qiguo Duan;Duoqian Miao;Min Chen

  • Affiliations:
  • Department of Computer Science and Technology, Tongji University, Shanghai 201804, China, The Key Laboratory of "Embedded System and Service Computing", Ministry of Education, Shanghai 201804, Chi ...;Department of Computer Science and Technology, Tongji University, Shanghai 201804, China, The Key Laboratory of "Embedded System and Service Computing", Ministry of Education, Shanghai 201804, Chi ...;Department of Computer Science and Technology, Tongji University, Shanghai 201804, China, The Key Laboratory of "Embedded System and Service Computing", Ministry of Education, Shanghai 201804, Chi ...

  • Venue:
  • RSFDGrC '07 Proceedings of the 11th International Conference on Rough Sets, Fuzzy Sets, Data Mining and Granular Computing
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

For traditional way of Web document representation in Vector Space Model, zero-valued similarity problem between vectors occurs frequently, which decreases classificatory quality when defining the relation between Web documents. In this paper, a novel Web document representation and classification approach based on rough set is proposed. Firstly, TF*IDF weighting scheme is used to assign weight values for Web document's vector. The weights of those terms which do not occur in a Web document are considered missing information. Then rough set for incomplete information is introduced to supplement loss and expand Web document representation. Through generating tolerance classes in both term space and Web document space, the missing information of Web document can be complemented by incorporating the corresponding weights of terms in tolerance classes, which extends the essential information to Web document. Finally, Web document classification algorithm is implemented. Experimental results show that the performance of the classification is greatly improved.