TYPiMatch: type-specific unsupervised learning of keys and key values for heterogeneous web data integration

Authors:
Yongtao Ma;Thanh Tran
Affiliations:
AIFB of KIT, Karlsruhe, Germany;AIFB of KIT, Karlsruhe, Germany
Venue:
Proceedings of the sixth ACM international conference on Web search and data mining
Year:
2013

Citing 19
Cited 0

Integration of heterogeneous databases without common domains using queries based on textual similarity

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Learning object identification rules for information integration

Information Systems - Data extraction, cleaning and reconciliation
A fast algorithm for the maximum clique problem

Discrete Applied Mathematics - Sixth Twente Workshop on Graphs and Combinatorial Optimization
Learning domain-independent string transformation weights for high accuracy object identification

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient set joins on similarity predicates

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
A hierarchical graphical model for record linkage

UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
A Primitive Operator for Similarity Joins in Data Cleaning

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Efficient exact set-similarity joins

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Adaptive Blocking: Learning to Scale Up Record Linkage

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Entity Resolution with Markov Logic

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Example-driven design of efficient record matching queries

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Entity resolution with iterative blocking

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Learning blocking schemes for record linkage

AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
Evaluation of entity resolution approaches on real-world match problems

Proceedings of the VLDB Endowment
Efficient entity resolution for large heterogeneous information spaces

Proceedings of the fourth ACM international conference on Web search and data mining
Entity matching: how similar is similar

Proceedings of the VLDB Endowment
Automatically generating data linkages using a domain-independent candidate selection approach

ISWC'11 Proceedings of the 10th international conference on The semantic web - Volume Part I
PARIS: probabilistic alignment of relations, instances, and schema

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Instance matching and blocking, a preprocessing step used for selecting candidate matches, require determining the most representative attributes of instances called keys, based on which similarities between instances are computed. We show that for the problem of learning blocking keys and key values, both generic techniques that do not exploit type information and supervised learning techniques optimized for one single predefined type of instances do not perform well on heterogeneous Web data capturing instances for which the predefined type is too general. That is, they actually belong to some subtypes that are not explicitly specified in the data. We propose an unsupervised approach for learning these subtypes and the subtype-specific blocking keys and key values. Compared to state-of-the-art supervised and unsupervised learning approaches that are optimized for one single type, our approach improves efficiency as well as result quality. In particular, we show that the proposed strategy of learning subtype-specific blocking keys and key values improves both blocking and instance matching results.