A cross-collection mixture model for comparative text mining
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Tracking dynamics of topic trends using a finite mixture model
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Cluster center initialization algorithm for K-means clustering
Pattern Recognition Letters
Discovering evolutionary theme patterns from text: an exploration of temporal text mining
Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Hi-index | 0.00 |
Research of data mining has developed many technologies of filtering out useful information from vast data, documents clustering is one of the important technologies. There are two approaches of documents clustering, one is clustering with metadata of documents, and the other is clustering with content of documents. Most of previous clustering approaches with documents contents focused on the documents summary (summary of single or multiple files) and the words vector analysis of documents, found the few and important keywords to conduct documents clustering. In this study, we categorize hot commodity on the web then denominate them, in accordance with the web text (abstracts) of these hot commodity and their accessing times. Firstly, parsing Chinese web text of documents for hot commodity, applied the hierarchical agglomerative clustering algorithm--Ward method to analyze the properties of words into themes and decide the number s of themes. Secondly, adopting the Cross Collection Mixture Model which applied in Temporal Text Mining and the accessing times( the degree of user identification words) to collect dynamic themes, then gather stable words by probability distribution to be the vectors of documents clustering. Thirdly, estimate parameters with Expectation Maximization (EM) algorithm. Finally, apply K-means with extracted dynamic themes to be the features of documents clustering. This study proposes a novel approach of documents clustering and through a series of experiment, it is proven that the algorithm is effective and can improve the accuracy of clustering results.