Multi-grain hierarchical topic extraction algorithm for text mining

  • Authors:
  • Jianping Zeng;Chengrong Wu;Wei Wang

  • Affiliations:
  • School of Computer Science, Fudan University, Shanghai 200433, PR China;School of Computer Science, Fudan University, Shanghai 200433, PR China;School of Computer Science, Fudan University, Shanghai 200433, PR China

  • Venue:
  • Expert Systems with Applications: An International Journal
  • Year:
  • 2010

Quantified Score

Hi-index 12.05

Visualization

Abstract

Topic extraction from text corpus is the fundamental of many topic analysis tasks, such as topic trend prediction, opinion extraction. Since hierarchical structure is characteristics of topics, it is preferential for a topic extraction algorithm to output the topics description with this kind of structure. However, the hierarchical topic structure that is extracted by most of the current topic analysis algorithms cannot provide a meaningful description for all subtopics in the hierarchical tree. Here, we propose a new hierarchical topic extraction algorithm based on topic grain computation. By considering the distribution of word document frequency as a mixture Gaussian, an EM-like algorithm is employed to achieve the best number of mixture components, and the mean value of each component. Then topic grain is defined based on the mixture Gaussian parameters, and feature words are selected for the grain. A clustering algorithm is employed to the converted text set based on the feature words. After repeatedly applying the clustering algorithm to different converted text set, a multi-grain hierarchical topic structure with different subtopic feature words description is extracted. Experiments on two real world datasets which are collected from a news website show that the proposed algorithm can generate more meaningful multi-grain topic structure, by comparing with the current hierarchical topic clustering algorithms.