A hybrid model words-driven approach for web product duplicate detection

Authors:
Marnix de Bakker;Flavius Frasincar;Damir Vandic
Affiliations:
Erasmus University Rotterdam, Rotterdam, The Netherlands;Erasmus University Rotterdam, Rotterdam, The Netherlands;Erasmus University Rotterdam, Rotterdam, The Netherlands
Venue:
CAiSE'13 Proceedings of the 25th international conference on Advanced Information Systems Engineering
Year:
2013

Citing 8
Cited 0

Support-Vector Networks

Machine Learning
Extended Boolean information retrieval

Communications of the ACM
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Adaptive Name Matching in Information Integration

IEEE Intelligent Systems
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Efficient similarity joins for near-duplicate detection

ACM Transactions on Database Systems (TODS)
Tailoring entity resolution for matching product offers

Proceedings of the 15th International Conference on Extending Database Technology
Faceted product search powered by the Semantic Web

Decision Support Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The detection of product duplicates is one of the challenges that Web shop aggregators are currently facing. In this paper, we focus on solving the problem of product duplicate detection on the Web. Our proposed method extends a state-of-the-art solution that uses the model words in product titles to find duplicate products. First, we employ the aforementioned algorithm in order to find matching product titles. If no matching title is found, our method continues by computing similarities between the two product descriptions. These similarities are based on the product attribute keys and on the product attribute values. Furthermore, instead of only extracting model words from the title, our method also extracts model words from the product attribute values. Based on our experimental results on real-world data gathered from two existing Web shops, we show that the proposed method, in terms of F1-measure, significantly outperforms the existing state-of-the-art title model words method and the well-known TF-IDF method.