Detecting near-duplicate relations in user generated forum content

Authors:
Klemens Muthmann;Alexander Löser
Affiliations:
Technical University Dresden, Computer Networks;Technical University Berlin, DIMA Group
Venue:
OTM'10 Proceedings of the 2010 international conference on On the move to meaningful internet systems
Year:
2010

Citing 12
Cited 1

Collection statistics for fast duplicate document detection

ACM Transactions on Information Systems (TOIS)
Similarity estimation techniques from rounding algorithms

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Identifying and Filtering Near-Duplicate Documents

COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Finding near-duplicate web pages: a large-scale evaluation of algorithms

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Building implicit links from content for forum search

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Finding similar files in a large file system

WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
Finding high-quality content in social media

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Introduction to Information Retrieval

Introduction to Information Retrieval
Detecting the origin of text segments efficiently

Proceedings of the 18th international conference on World wide web
Simultaneously modeling semantics and structure of threaded discussions: a sparse coding approach and its applications

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval

Large-scale copy detection

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data

Quantified Score

Hi-index	0.00

Visualization

Abstract

A webforum is a large database of community knowledge, with information of the most recent events and developments. Unfortunately this knowledge is presented in a format easily understood by humans but not automatically by machines. However, from observing several forums for a long time it seems obvious that there are several distinct types of postings and relations between them. One often occurring and very annoying relation between two contributions is the near-duplicate relation. In this paper we propose a work to detect and utilize contribution relations, concentrating on near-duplication. We propose ideas on how to calculate similarity, build groups of similar threads and thus make near-duplicates in forums evident. One of the core theses is, that it is possible to apply information from forum and thread structure to improve existing near-duplicate detection approaches. In addition, the proposed work shows the qualitative and quantitative results of applying such principles, thereby finding out which features are really useful in the near-duplicate detection process. Also proposed are several sample applications, which benefit from forum near-duplicate detection.