Web Document Classification Based on Rough Set

Authors:
Qiguo Duan;Duoqian Miao;Min Chen
Affiliations:
Department of Computer Science and Technology, Tongji University, Shanghai 201804, China, The Key Laboratory of "Embedded System and Service Computing", Ministry of Education, Shanghai 201804, Chi ...;Department of Computer Science and Technology, Tongji University, Shanghai 201804, China, The Key Laboratory of "Embedded System and Service Computing", Ministry of Education, Shanghai 201804, Chi ...;Department of Computer Science and Technology, Tongji University, Shanghai 201804, China, The Key Laboratory of "Embedded System and Service Computing", Ministry of Education, Shanghai 201804, Chi ...
Venue:
RSFDGrC '07 Proceedings of the 11th International Conference on Rough Sets, Fuzzy Sets, Data Mining and Granular Computing
Year:
2009

Citing 8
Cited 1

Tolerance approximation spaces

Fundamenta Informaticae - Special issue: rough sets
Rough set approach to incomplete information systems

Information Sciences: an International Journal
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Rough Sets: Theoretical Aspects of Reasoning about Data

Rough Sets: Theoretical Aspects of Reasoning about Data
Using urls and table layout for web classification tasks

Proceedings of the 13th international conference on World Wide Web
A comparison of rough set methods and representative inductive learning algorithms

Fundamenta Informaticae - Special issue on the 9th international conference on rough sets, fuzzy sets, data mining and granular computing (RSFDGrC 2003)
A tolerance rough set approach to clustering web search results

PKDD '04 Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases
Hierarchical classification of HTML documents with WebClassII

ECIR'03 Proceedings of the 25th European conference on IR research

Text categorization based on fuzzy soft set theory

ICCSA'12 Proceedings of the 12th international conference on Computational Science and Its Applications - Volume Part IV

Quantified Score

Hi-index	0.00

Visualization

Abstract

For traditional way of Web document representation in Vector Space Model, zero-valued similarity problem between vectors occurs frequently, which decreases classificatory quality when defining the relation between Web documents. In this paper, a novel Web document representation and classification approach based on rough set is proposed. Firstly, TF*IDF weighting scheme is used to assign weight values for Web document's vector. The weights of those terms which do not occur in a Web document are considered missing information. Then rough set for incomplete information is introduced to supplement loss and expand Web document representation. Through generating tolerance classes in both term space and Web document space, the missing information of Web document can be complemented by incorporating the corresponding weights of terms in tolerance classes, which extends the essential information to Web document. Finally, Web document classification algorithm is implemented. Experimental results show that the performance of the classification is greatly improved.