Multi-grain hierarchical topic extraction algorithm for text mining

Authors:
Jianping Zeng;Chengrong Wu;Wei Wang
Affiliations:
School of Computer Science, Fudan University, Shanghai 200433, PR China;School of Computer Science, Fudan University, Shanghai 200433, PR China;School of Computer Science, Fudan University, Shanghai 200433, PR China
Venue:
Expert Systems with Applications: An International Journal
Year:
2010

Citing 16
Cited 2

Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
X-means: Extending K-means with Efficient Estimation of the Number of Clusters

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Probabilistic model-based clustering of complex data

Probabilistic model-based clustering of complex data
Generative model-based document clustering: a comparative study

Knowledge and Information Systems
Incorporating with Recursive Model Training in Time Series Clustering

CIT '05 Proceedings of the The Fifth International Conference on Computer and Information Technology
Pachinko allocation: DAG-structured mixture models of topic correlations

ICML '06 Proceedings of the 23rd international conference on Machine learning
A General Framework for Agglomerative Hierarchical Clustering Algorithms

ICPR '06 Proceedings of the 18th International Conference on Pattern Recognition - Volume 02
Incremental hierarchical clustering of text documents

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Exploiting asymmetry in hierarchical topic extraction

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Topic sentiment mixture: modeling facets and opinions in weblogs

Proceedings of the 16th international conference on World Wide Web
Lognormal Distribution of BBS Articles and its Social and Generative Mechanism

WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
Mixtures of hierarchical topics with Pachinko allocation

Proceedings of the 24th international conference on Machine learning
Short communication: Variable space hidden Markov model for topic detection and analysis

Knowledge-Based Systems
Modeling online reviews with multi-grain topic models

Proceedings of the 17th international conference on World Wide Web
Opinion integration through semi-supervised topic modeling

Proceedings of the 17th international conference on World Wide Web
A study of cross-validation and bootstrap for accuracy estimation and model selection

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2

Topic-based ranking in Folksonomy via probabilistic model

Artificial Intelligence Review
Topics modeling based on selective Zipf distribution

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	12.05

Visualization

Abstract

Topic extraction from text corpus is the fundamental of many topic analysis tasks, such as topic trend prediction, opinion extraction. Since hierarchical structure is characteristics of topics, it is preferential for a topic extraction algorithm to output the topics description with this kind of structure. However, the hierarchical topic structure that is extracted by most of the current topic analysis algorithms cannot provide a meaningful description for all subtopics in the hierarchical tree. Here, we propose a new hierarchical topic extraction algorithm based on topic grain computation. By considering the distribution of word document frequency as a mixture Gaussian, an EM-like algorithm is employed to achieve the best number of mixture components, and the mean value of each component. Then topic grain is defined based on the mixture Gaussian parameters, and feature words are selected for the grain. A clustering algorithm is employed to the converted text set based on the feature words. After repeatedly applying the clustering algorithm to different converted text set, a multi-grain hierarchical topic structure with different subtopic feature words description is extracted. Experiments on two real world datasets which are collected from a news website show that the proposed algorithm can generate more meaningful multi-grain topic structure, by comparing with the current hierarchical topic clustering algorithms.