Tolerance approximation spaces
Fundamenta Informaticae - Special issue: rough sets
Rough set approach to incomplete information systems
Information Sciences: an International Journal
Machine learning in automated text categorization
ACM Computing Surveys (CSUR)
Rough Sets: Theoretical Aspects of Reasoning about Data
Rough Sets: Theoretical Aspects of Reasoning about Data
Using urls and table layout for web classification tasks
Proceedings of the 13th international conference on World Wide Web
A comparison of rough set methods and representative inductive learning algorithms
Fundamenta Informaticae - Special issue on the 9th international conference on rough sets, fuzzy sets, data mining and granular computing (RSFDGrC 2003)
A tolerance rough set approach to clustering web search results
PKDD '04 Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases
Hierarchical classification of HTML documents with WebClassII
ECIR'03 Proceedings of the 25th European conference on IR research
Text categorization based on fuzzy soft set theory
ICCSA'12 Proceedings of the 12th international conference on Computational Science and Its Applications - Volume Part IV
Hi-index | 0.00 |
For traditional way of Web document representation in Vector Space Model, zero-valued similarity problem between vectors occurs frequently, which decreases classificatory quality when defining the relation between Web documents. In this paper, a novel Web document representation and classification approach based on rough set is proposed. Firstly, TF*IDF weighting scheme is used to assign weight values for Web document's vector. The weights of those terms which do not occur in a Web document are considered missing information. Then rough set for incomplete information is introduced to supplement loss and expand Web document representation. Through generating tolerance classes in both term space and Web document space, the missing information of Web document can be complemented by incorporating the corresponding weights of terms in tolerance classes, which extends the essential information to Web document. Finally, Web document classification algorithm is implemented. Experimental results show that the performance of the classification is greatly improved.