Multinomial mixture model with feature selection for text clustering

  • Authors:
  • Minqiang Li;Liang Zhang

  • Affiliations:
  • School of Management, Tianjin University, 92 Weijin Road, Nankai District, Tianjin 300072, China;School of Management, Tianjin University, 92 Weijin Road, Nankai District, Tianjin 300072, China

  • Venue:
  • Knowledge-Based Systems
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

The task of selecting relevant features is a hard problem in the field of unsupervised text clustering due to the absence of class labels that would guide the search. This paper proposes a new mixture model method for unsupervised text clustering, named multinomial mixture model with feature selection (M3FS). In M3FS, we introduce the concept of component-dependent ''feature saliency'' to the mixture model. We say a feature is relevant to a certain mixture component if the feature saliency value is higher than a predefined threshold. Thus the feature selection process is treated as a parameter estimation problem. The Expectation-Maximization (EM) algorithm is then used for estimating the model. The experiment results on commonly used text datasets show that the M3FS method has good clustering performance and feature selection capability.