TYPiMatch: type-specific unsupervised learning of keys and key values for heterogeneous web data integration

  • Authors:
  • Yongtao Ma;Thanh Tran

  • Affiliations:
  • AIFB of KIT, Karlsruhe, Germany;AIFB of KIT, Karlsruhe, Germany

  • Venue:
  • Proceedings of the sixth ACM international conference on Web search and data mining
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Instance matching and blocking, a preprocessing step used for selecting candidate matches, require determining the most representative attributes of instances called keys, based on which similarities between instances are computed. We show that for the problem of learning blocking keys and key values, both generic techniques that do not exploit type information and supervised learning techniques optimized for one single predefined type of instances do not perform well on heterogeneous Web data capturing instances for which the predefined type is too general. That is, they actually belong to some subtypes that are not explicitly specified in the data. We propose an unsupervised approach for learning these subtypes and the subtype-specific blocking keys and key values. Compared to state-of-the-art supervised and unsupervised learning approaches that are optimized for one single type, our approach improves efficiency as well as result quality. In particular, we show that the proposed strategy of learning subtype-specific blocking keys and key values improves both blocking and instance matching results.