Exploiting user clicks for automatic seed set generation for entity matching

Authors:
Xiao Bai;Flavio P. Junqueira;Srinivasan H. Sengamedu
Affiliations:
Yahoo! Research, Barcelona, Spain;Microsoft Research, Cambridge, United Kingdom;Komli Labs, Bangalore, India
Venue:
Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2013

Citing 23
Cited 0

A comparison of parallel algorithms for connected components

SPAA '94 Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures
A scalable comparison-shopping agent for the World-Wide Web

AGENTS '97 Proceedings of the first international conference on Autonomous agents
Co-clustering documents and words using bipartite spectral graph partitioning

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Learning object identification rules for information integration

Information Systems - Data extraction, cleaning and reconciliation
Introduction to Algorithms

Introduction to Algorithms
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Automatic multimedia cross-modal correlation discovery

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Image annotation refinement using random walk with restarts

MULTIMEDIA '06 Proceedings of the 14th annual ACM international conference on Multimedia
Adaptive Blocking: Learning to Scale Up Record Linkage

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Fast Random Walk with Restart and Its Applications

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Random walks on the click graph

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Extracting semantic relations from query logs

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Context-aware query suggestion by mining click-through and session data

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Query clustering using click-through graph

Proceedings of the 18th international conference on World wide web
Frameworks for entity matching: A comparison

Data & Knowledge Engineering
The paths more taken: matching DOM trees to search logs for accurate webpage clustering

Proceedings of the 19th international conference on World wide web
Pregel: a system for large-scale graph processing

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Ranking entities using web search query logs

ECDL'10 Proceedings of the 14th European conference on Research and advanced technology for digital libraries
Large-scale collective entity matching

Proceedings of the VLDB Endowment
Ranking related entities for web search queries

Proceedings of the 20th international conference companion on World wide web
Approximate data instance matching: a survey

Knowledge and Information Systems
Entity resolution: theory, practice & open challenges

Proceedings of the VLDB Endowment
Measuring website similarity using an entity-aware click graph

Proceedings of the 21st ACM international conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Matching entities from different information sources is a very important problem in data analysis and data integration. It is, however, challenging due to the number and diversity of information sources involved, and the significant editorial efforts required to collect sufficient training data. In this paper, we present an approach that leverages user clicks during Web search to automatically generate training data for entity matching. The key insight of our approach is that Web pages clicked for a given query are likely to be about the same entity. We use random walk with restart to reduce data sparseness, rely on co-clustering to group queries and Web pages, and exploit page similarity to improve matching precision. Experimental results show that: (i) With 360K pages from 6 major travel websites, we obtain 84K matchings (of 179K pages) that refer to the same entities, with an average precision of 0.826; (ii) The quality of matching obtained from a classifier trained on the resulted seed data is promising: the performance matches that of editorial data at small size and improves with size.