A unified framework for document clustering with dual supervision

  • Authors:
  • Yeming Hu;Evangelos E. Milios;James Blustein

  • Affiliations:
  • Dalhousie University, Halifax, Canada;Dalhousie University, Halifax, Canada;Dalhousie University

  • Venue:
  • ACM SIGAPP Applied Computing Review
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Semi-supervised clustering algorithms for general problems use a small amount of labeled instances or pairwise instance constraints to aid the unsupervised clustering. However, user supervision can also be provided in alternative forms for document clustering, such as labeling a feature by associating it with a document or a cluster. Besides labeled documents, this paper also explores labeled features to generate cluster seeds to seed the unsupervised clustering. In this paper, we present a unified framework in which one can use both labeled documents and features in terms of seeding clusters and refine this information using intermediate clusters. We introduce two methods of using labeled features to generate cluster seeds. Experimental results on several real-world data sets demonstrate that constraining the clustering by both documents and features seeding can significantly improve document clustering performance over random seeding and document only seeding. We also demonstrate that the clustering performance can be improved even with only a fraction of clusters being seeded compared to unsupervised clustering.