A general method of mining Chinese web documents based on GA&SA and position-factors

Authors:
Xi Bai;Jigui Sun;Haiyan Che;Jin Wang
Affiliations:
College of Computer Science and Technology, Jilin University, Changchun, China and Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Chan ...;College of Computer Science and Technology, Jilin University, Changchun, China and Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Chan ...;College of Computer Science and Technology, Jilin University, Changchun, China and Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Chan ...;Institute of Network and Information Security, Shandong University, Jinan, China
Venue:
PAKDD'07 Proceedings of the 2007 international conference on Emerging technologies in knowledge discovery and data mining
Year:
2007

Citing 6
Cited 0

Distance Courseware Discrimination Based on Representative Sentence Assaying

DASFAA '01 Proceedings of the 7th International Conference on Database Systems for Advanced Applications
Efficient Genetic Algorithm Based Data Mining Using Feature Selection with Hausdorff Distance

Information Technology and Management
Context modeling and discovery using vector space bases

Proceedings of the 14th ACM international conference on Information and knowledge management
An improved simulated annealing algorithm for the maximum independent set problem

ICIC'06 Proceedings of the 2006 international conference on Intelligent Computing - Volume Part I
LRD: latent relation discovery for vector space expansion and information retrieval

WAIM '06 Proceedings of the 7th international conference on Advances in Web-Age Information Management
Representative term based feature selection method for SVM based document classification

KES'05 Proceedings of the 9th international conference on Knowledge-Based Intelligent Information and Engineering Systems - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clustering and classification are two important techniques of mining Web information. In this paper, a new adaptive method of mining Chinese documents from the internet is proposed. First, we give an algorithm of clustering documents which combines Genetic Algorithm(GA) and Simulated Annealing(SA) based on Boolean Model. This Algorithm avoids the disadvantage of clustering documents by using pure GA which can not be utilized accurately since GA converges too early and bogs the local optimum. Then, considering that the effect of classification with traditional Vector Space Model(VSM) is not satisfying enough since it is not related to the grades of importance of words, we add the position-factors of key words into VSM and set up a new classifier model to classify Chinese Web documents. Experimental results indicate that this adaptive method can make the process of clustering and classification more accurate and reasonable comparing to the methods which does not have the positions of words considered.