Near-duplicate detection for web-forums

Authors:
Klemens Muthmann;Wojciech M. Barczyński;Falk Brauer;Alexander Löser
Affiliations:
Technische Universität Dresden, Dresden, Germany;SAP AG, SAP Research, Dresden, Germany;SAP AG, SAP Research, Dresden, Germany;Technische Universität Berlin, Berlin, Germany
Venue:
IDEAS '09 Proceedings of the 2009 International Database Engineering & Applications Symposium
Year:
2009

Citing 10
Cited 5

Collection statistics for fast duplicate document detection

ACM Transactions on Information Systems (TOIS)
Similarity estimation techniques from rounding algorithms

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Modern Information Retrieval

Modern Information Retrieval
Similarity Search in High Dimensions via Hashing

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
SimFusion: measuring similarity using unified relationship matrix

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Finding near-duplicate web pages: a large-scale evaluation of algorithms

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Finding similar files in a large file system

WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
Toward a PeopleWeb

Computer
Finding high-quality content in social media

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
SpotSigs: robust and efficient near duplicate detection in large web collections

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval

Graph-based concept identification and disambiguation for enterprise search

Proceedings of the 19th international conference on World wide web
Efficient partial-duplicate detection based on sequence matching

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Hypergeometric language models for republished article finding

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Predicting thread discourse structure over technical web forums

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Learning hash codes for efficient content reuse detection

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Current forum search technologies lack the ability to identify threads with near-duplicate content and to group these threads in the search results. As a result, forum users are overloaded with duplicated search results and prefer to create new threads without trying to find existing ones. In this paper we therefore identify common reasons leading to near-duplicates and develop a new near-duplicate detection algorithm for forum threads. The algorithm is implemented using a large case study of a real-world forum serving more than one million users. We compare this work with current algorithms, similar to [4, 5], for detecting near-duplicates on machine generated web pages. Our preliminary results show, that we significantly outperform these algorithms and that we are able to group forum threads with a precision of 74%.