Model words-driven approaches for duplicate detection on the web

Authors:
Marnix de Bakker;Damir Vandic;Flavius Frasincar;Uzay Kaymak
Affiliations:
Erasmus University Rotterdam, Rotterdam, The Netherlands;Erasmus University Rotterdam, Rotterdam, The Netherlands;Erasmus University Rotterdam, Rotterdam, The Netherlands;Eindhoven University of Technology, Eindhoven, The Netherlands
Venue:
Proceedings of the 28th Annual ACM Symposium on Applied Computing
Year:
2013

Citing 7
Cited 0

Support-Vector Networks

Machine Learning
Learning String-Edit Distance

IEEE Transactions on Pattern Analysis and Machine Intelligence
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Introduction to Data Mining

Introduction to Data Mining
Efficient similarity joins for near-duplicate detection

ACM Transactions on Database Systems (TODS)
Faceted product search powered by the Semantic Web

Decision Support Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The detection of product duplicates is one of the many challenges that Web shop product aggregators are facing. This paper presents two new methods to solve the problem of product duplicate detection. Both methods extend a state-of-the-art approach that uses the found model words in product titles to detect product duplicates. The first proposed method uses several distance measures to calculate distances between product attribute keys and values to find duplicate products when no matching product title is found. The second proposed method detects matching model words in all product attribute values in order to find duplicate products when no matching product title is found. Based on our experimental results on real-world data gathered from two existing Web shops, we show that the second proposed method significantly outperforms the existing state-of-the-art method in terms of F1-measure, while the first method outperforms the existing state-of-the-art method in terms of F1-measure, but not significantly.