Adaptive near-duplicate detection via similarity learning

  • Authors:
  • Hannaneh Hajishirzi;Wen-tau Yih;Aleksander Kolcz

  • Affiliations:
  • University of Illinois at Urbana-Champaign, Urbana, IL, USA;Microsoft Research, Redmond, WA, USA;Microsoft, Redmond, WA, USA

  • Venue:
  • Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we present a novel near-duplicate document detection method that can easily be tuned for a particular domain. Our method represents each document as a real-valued sparse k-gram vector, where the weights are learned to optimize for a specified similarity function, such as the cosine similarity or the Jaccard coefficient. Near-duplicate documents can be reliably detected through this improved similarity measure. In addition, these vectors can be mapped to a small number of hash-values as document signatures through the locality sensitive hashing scheme for efficient similarity computation. We demonstrate our approach in two target domains: Web news articles and email messages. Our method is not only more accurate than the commonly used methods such as Shingles and I-Match, but also shows consistent improvement across the domains, which is a desired property lacked by existing methods.