Semi-supervised PLSA for Document Clustering

  • Authors:
  • Lingfeng Niu;Yong Shi

  • Affiliations:
  • -;-

  • Venue:
  • ICDMW '10 Proceedings of the 2010 IEEE International Conference on Data Mining Workshops
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

By utilizing the must-link or cannot-link pair wise constraints in data, semi-supervised clustering improves the performance of unsupervised clustering significantly. A number of semi-supervised clustering algorithms have been proposed to consider such pair wise constraints. However, most of them assign a hard label to each data item and produce little information about the cluster itself. In this work, we propose a Probabilistic Latent Semantic Analysis(PLSA) based semi-supervised algorithm for documents clustering by employing the must-link supervision between two documents, which is available in many real world data. The new algorithm can produce the soft cluster label assignment for each document as well as the probabilistic representation of latent topics in the cluster. No additional parameters need to be estimated besides the parameters in standard PLSA. This reduces the risk of over-fitting especially when the data is sparse. We provide the Expectation Maximization(EM) procedure for semi-supervised PLSA to determine the local optimal parameters that maximize the likelihood. To utilize multiple computation nodes for large scale data set, we also propose a distributed implementation of the EM procedure based on the MapReduce framework. Experimental results on public data set validate the effectiveness and efficiency of the new method.