Model words-driven approaches for duplicate detection on the web

  • Authors:
  • Marnix de Bakker;Damir Vandic;Flavius Frasincar;Uzay Kaymak

  • Affiliations:
  • Erasmus University Rotterdam, Rotterdam, The Netherlands;Erasmus University Rotterdam, Rotterdam, The Netherlands;Erasmus University Rotterdam, Rotterdam, The Netherlands;Eindhoven University of Technology, Eindhoven, The Netherlands

  • Venue:
  • Proceedings of the 28th Annual ACM Symposium on Applied Computing
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

The detection of product duplicates is one of the many challenges that Web shop product aggregators are facing. This paper presents two new methods to solve the problem of product duplicate detection. Both methods extend a state-of-the-art approach that uses the found model words in product titles to detect product duplicates. The first proposed method uses several distance measures to calculate distances between product attribute keys and values to find duplicate products when no matching product title is found. The second proposed method detects matching model words in all product attribute values in order to find duplicate products when no matching product title is found. Based on our experimental results on real-world data gathered from two existing Web shops, we show that the second proposed method significantly outperforms the existing state-of-the-art method in terms of F1-measure, while the first method outperforms the existing state-of-the-art method in terms of F1-measure, but not significantly.