Exploiting Forum Thread Structures to Improve Thread Clustering

  • Authors:
  • Kumaresh Pattabiraman;Parikshit Sondhi;ChengXiang Zhai

  • Affiliations:
  • University of Illinois, Urbana-Champaign, Dept. of Computer Science;University of Illinois, Urbana-Champaign, Dept. of Computer Science;University of Illinois, Urbana-Champaign, Dept. of Computer Science

  • Venue:
  • Proceedings of the 2013 Conference on the Theory of Information Retrieval
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Automated clustering of threads within and across web forums will greatly benefit both users and forum administrators in efficiently seeking, managing, and integrating the huge volume of content being generated. While clustering has been studied for other types of data, little work has been done on clustering forum threads; the informal nature and special structure of forum data make it interesting to study how to effectively cluster forum threads. In this paper, we apply three state of the art clustering methods (i.e., hierarchical agglomerative clustering, k-Means, and probabilistic latent semantic analysis) to cluster forum threads and study how to leverage the structure of threads to improve clustering accuracy. We propose three different methods for assigning weights to the posts in a forum thread to achieve more accurate representation of a thread. We evaluate all the methods on data collected from three different Linux forums for both within-forum and across-forum clustering. Our results show that the state of the art methods perform reasonably well for this task, but the performance can be further improved by exploiting thread structures. In particular, a parabolic weighting method that assigns higher weights for both beginning posts and end posts of a thread is shown to consistently outperform a standard clustering method.