Detecting near-duplicate relations in user generated forum content

  • Authors:
  • Klemens Muthmann;Alexander Löser

  • Affiliations:
  • Technical University Dresden, Computer Networks;Technical University Berlin, DIMA Group

  • Venue:
  • OTM'10 Proceedings of the 2010 international conference on On the move to meaningful internet systems
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

A webforum is a large database of community knowledge, with information of the most recent events and developments. Unfortunately this knowledge is presented in a format easily understood by humans but not automatically by machines. However, from observing several forums for a long time it seems obvious that there are several distinct types of postings and relations between them. One often occurring and very annoying relation between two contributions is the near-duplicate relation. In this paper we propose a work to detect and utilize contribution relations, concentrating on near-duplication. We propose ideas on how to calculate similarity, build groups of similar threads and thus make near-duplicates in forums evident. One of the core theses is, that it is possible to apply information from forum and thread structure to improve existing near-duplicate detection approaches. In addition, the proposed work shows the qualitative and quantitative results of applying such principles, thereby finding out which features are really useful in the near-duplicate detection process. Also proposed are several sample applications, which benefit from forum near-duplicate detection.