Restrictive clustering and metaclustering for self-organizing document collections

  • Authors:
  • Stefan Siersdorfer;Sergej Sizov

  • Affiliations:
  • Max Planck Institut fuer Informatik, Saarbruecken, Germany;Max Planck Institut fuer Informatik, Saarbruecken, Germany

  • Venue:
  • Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper addresses the problem of automatically structuring heterogenous document collections by using clustering methods. In contrast to traditional clustering, we study restrictive methods and ensemble-based meta methods that may decide to leave out some documents rather than assigning them to inappropriate clusters with low confidence. These techniques result in higher cluster purity, better overall accuracy, and make unsupervised self-organization more robust. Our comprehensive experimental studies on three different real-world data collections demonstrate these benefits. The proposed methods seem particularly suitable for automatically substructuring personal email folders or personal Web directories that are populated by focused crawlers, and they can be combined with supervised classification techniques.