Personalized document clustering with dual supervision

  • Authors:
  • Yeming Hu;Evangelos E. Milios;James Blustein;Shali Liu

  • Affiliations:
  • Dalhousie University, Halifax, NS, Canada;Dalhousie University, Halifax, NS, Canada;Dalhousie University, Halifax, NS, Canada;Dalhousie University, Halifax, NS, Canada

  • Venue:
  • Proceedings of the 2012 ACM symposium on Document engineering
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

The potential for semi-supervised techniques to produce personalized clusters has not been explored. This is due to the fact that semi-supervised clustering algorithms used to be evaluated using oracles based on underlying class labels. Although using oracles allows clustering algorithms to be evaluated quickly and without labor intensive labeling, it has the key disadvantage that oracles always give the same answer for an assignment of a document or a feature. However, different human users might give different assignments of the same document and/or feature because of different but equally valid points of view. In this paper, we conduct a user study in which we ask participants (users) to group the same document collection into clusters according to their own understanding, which are then used to evaluate semi-supervised clustering algorithms for user personalization. Through our user study, we observe that different users have their own personalized organizations of the same collection and a user's organization changes over time. Therefore, we propose that document clustering algorithms should be able to incorporate user input and produce personalized clusters based on the user input. We also confirm that semi-supervised algorithms with noisy user input can still produce better organizations matching user's expectation (personalization) than traditional unsupervised ones. Finally, we demonstrate that labeling keywords for clusters at the same time as labeling documents can improve clustering performance further compared to labeling only documents with respect to user personalization.