A general method of mining Chinese web documents based on GA&SA and position-factors

  • Authors:
  • Xi Bai;Jigui Sun;Haiyan Che;Jin Wang

  • Affiliations:
  • College of Computer Science and Technology, Jilin University, Changchun, China and Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Chan ...;College of Computer Science and Technology, Jilin University, Changchun, China and Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Chan ...;College of Computer Science and Technology, Jilin University, Changchun, China and Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Chan ...;Institute of Network and Information Security, Shandong University, Jinan, China

  • Venue:
  • PAKDD'07 Proceedings of the 2007 international conference on Emerging technologies in knowledge discovery and data mining
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Clustering and classification are two important techniques of mining Web information. In this paper, a new adaptive method of mining Chinese documents from the internet is proposed. First, we give an algorithm of clustering documents which combines Genetic Algorithm(GA) and Simulated Annealing(SA) based on Boolean Model. This Algorithm avoids the disadvantage of clustering documents by using pure GA which can not be utilized accurately since GA converges too early and bogs the local optimum. Then, considering that the effect of classification with traditional Vector Space Model(VSM) is not satisfying enough since it is not related to the grades of importance of words, we add the position-factors of key words into VSM and set up a new classifier model to classify Chinese Web documents. Experimental results indicate that this adaptive method can make the process of clustering and classification more accurate and reasonable comparing to the methods which does not have the positions of words considered.