SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Learning object identification rules for information integration
Information Systems - Data extraction, cleaning and reconciliation
A fast algorithm for the maximum clique problem
Discrete Applied Mathematics - Sixth Twente Workshop on Graphs and Combinatorial Optimization
Learning domain-independent string transformation weights for high accuracy object identification
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Adaptive duplicate detection using learnable string similarity measures
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient set joins on similarity predicates
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
A hierarchical graphical model for record linkage
UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
A Primitive Operator for Similarity Joins in Data Cleaning
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Efficient exact set-similarity joins
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Adaptive Blocking: Learning to Scale Up Record Linkage
ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Entity Resolution with Markov Logic
ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Example-driven design of efficient record matching queries
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Entity resolution with iterative blocking
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Learning blocking schemes for record linkage
AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
Evaluation of entity resolution approaches on real-world match problems
Proceedings of the VLDB Endowment
Efficient entity resolution for large heterogeneous information spaces
Proceedings of the fourth ACM international conference on Web search and data mining
Entity matching: how similar is similar
Proceedings of the VLDB Endowment
Automatically generating data linkages using a domain-independent candidate selection approach
ISWC'11 Proceedings of the 10th international conference on The semantic web - Volume Part I
PARIS: probabilistic alignment of relations, instances, and schema
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
Instance matching and blocking, a preprocessing step used for selecting candidate matches, require determining the most representative attributes of instances called keys, based on which similarities between instances are computed. We show that for the problem of learning blocking keys and key values, both generic techniques that do not exploit type information and supervised learning techniques optimized for one single predefined type of instances do not perform well on heterogeneous Web data capturing instances for which the predefined type is too general. That is, they actually belong to some subtypes that are not explicitly specified in the data. We propose an unsupervised approach for learning these subtypes and the subtype-specific blocking keys and key values. Compared to state-of-the-art supervised and unsupervised learning approaches that are optimized for one single type, our approach improves efficiency as well as result quality. In particular, we show that the proposed strategy of learning subtype-specific blocking keys and key values improves both blocking and instance matching results.