Web text clustering with dynamic themes

  • Authors:
  • Ping Ju Hung;Ping Yu Hsu;Ming Shien Cheng;Chih Hao Wen

  • Affiliations:
  • National Central University, Department of Business Administration, Jhongli City, Taoyuan County, Taiwan, ROC;National Central University, Department of Business Administration, Jhongli City, Taoyuan County, Taiwan, ROC;Ming Chi University of Technology, Department of Industrial Engineering and Management, New Taipei City, Taiwan, ROC;National Central University, Department of Business Administration, Jhongli City, Taoyuan County, Taiwan, ROC

  • Venue:
  • WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Research of data mining has developed many technologies of filtering out useful information from vast data, documents clustering is one of the important technologies. There are two approaches of documents clustering, one is clustering with metadata of documents, and the other is clustering with content of documents. Most of previous clustering approaches with documents contents focused on the documents summary (summary of single or multiple files) and the words vector analysis of documents, found the few and important keywords to conduct documents clustering. In this study, we categorize hot commodity on the web then denominate them, in accordance with the web text (abstracts) of these hot commodity and their accessing times. Firstly, parsing Chinese web text of documents for hot commodity, applied the hierarchical agglomerative clustering algorithm--Ward method to analyze the properties of words into themes and decide the number s of themes. Secondly, adopting the Cross Collection Mixture Model which applied in Temporal Text Mining and the accessing times( the degree of user identification words) to collect dynamic themes, then gather stable words by probability distribution to be the vectors of documents clustering. Thirdly, estimate parameters with Expectation Maximization (EM) algorithm. Finally, apply K-means with extracted dynamic themes to be the features of documents clustering. This study proposes a novel approach of documents clustering and through a series of experiment, it is proven that the algorithm is effective and can improve the accuracy of clustering results.