Near-duplicate detection for web-forums

  • Authors:
  • Klemens Muthmann;Wojciech M. Barczyński;Falk Brauer;Alexander Löser

  • Affiliations:
  • Technische Universität Dresden, Dresden, Germany;SAP AG, SAP Research, Dresden, Germany;SAP AG, SAP Research, Dresden, Germany;Technische Universität Berlin, Berlin, Germany

  • Venue:
  • IDEAS '09 Proceedings of the 2009 International Database Engineering & Applications Symposium
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Current forum search technologies lack the ability to identify threads with near-duplicate content and to group these threads in the search results. As a result, forum users are overloaded with duplicated search results and prefer to create new threads without trying to find existing ones. In this paper we therefore identify common reasons leading to near-duplicates and develop a new near-duplicate detection algorithm for forum threads. The algorithm is implemented using a large case study of a real-world forum serving more than one million users. We compare this work with current algorithms, similar to [4, 5], for detecting near-duplicates on machine generated web pages. Our preliminary results show, that we significantly outperform these algorithms and that we are able to group forum threads with a precision of 74%.