Matching product titles using web-based enrichment

Authors:
Vishrawas Gopalakrishnan;Suresh Parthasarathy Iyengar;Amit Madaan;Rajeev Rastogi;Srinivasan Sengamedu
Affiliations:
Suny Buffalo, Buffalo, NY, USA;Yahoo Labs, Bangalore, India;TheFind, Inc, Mountain View, CA, USA;Amazon, Bangalore, India;Komli Labs, Bangalore, India
Venue:
Proceedings of the 21st ACM international conference on Information and knowledge management
Year:
2012

Citing 21
Cited 0

Improving automatic query expansion

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

Data Mining and Knowledge Discovery
Interactive deduplication using active learning

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Robust and efficient fuzzy match for online data cleaning

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Adaptive Product Normalization: Using Online Learning for Record Linkage in Comparison Shopping

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
A Primitive Operator for Similarity Joins in Data Cleaning

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
A web-based kernel function for measuring the similarity of short text snippets

Proceedings of the 15th international conference on World Wide Web
Search engine driven author disambiguation

Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Collective entity resolution in relational data

ACM Transactions on Knowledge Discovery from Data (TKDD)
The Google Similarity Distance

IEEE Transactions on Knowledge and Data Engineering
Robust classification of rare queries using web knowledge

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Information Retrieval

Introduction to Information Retrieval
Exploiting web search to generate synonyms for entities

Proceedings of the 18th international conference on World wide web
Swoosh: a generic approach to entity resolution

The VLDB Journal — The International Journal on Very Large Data Bases
Mining document collections to facilitate accurate approximate entity matching

Proceedings of the VLDB Endowment
Evaluation of entity resolution approaches on real-world match problems

Proceedings of the VLDB Endowment
Approximate data instance matching: a survey

Knowledge and Information Systems
Matching unstructured product offers to structured product specifications

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Tailoring entity resolution for matching product offers

Proceedings of the 15th International Conference on Extending Database Technology
A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication

IEEE Transactions on Knowledge and Data Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Matching product titles from different data feeds that refer to the same underlying product entity is a key problem in online shopping. This matching problem is challenging because titles across the feeds have diverse representations with some missing important keywords like brand and others containing extraneous keywords related to product specifications. In this paper, we propose a novel unsupervised matching algorithm that leverages web earch engines to (1) enrich product titles by adding important missing tokens that occur frequently in search results, and (2) compute importance scores for tokens based on their ability to retrieve other (enriched title) tokens in search results. Our matching scheme calculates the Cosine similarity between enriched title pairs with tokens weighted by their importance scores. We propose an optimization that exploits the templatized structure of product titles to reduce the number of search queries. In experiments with real-life shopping datasets, we found that our matching algorithm has superior F1 scores compared to IDF-based cosine similarity.