Unsupervised learning of link discovery configuration

Authors:
Andriy Nikolov;Mathieu d'Aquin;Enrico Motta
Affiliations:
Knowledge Media Institute, The Open University, Milton Keynes, UK;Knowledge Media Institute, The Open University, Milton Keynes, UK;Knowledge Media Institute, The Open University, Milton Keynes, UK
Venue:
ESWC'12 Proceedings of the 9th international conference on The Semantic Web: research and applications
Year:
2012

Citing 9
Cited 3

Robust Identification of Fuzzy Duplicates

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Integration of Semantically Annotated Data by the KnoFuss Architecture

EKAW '08 Proceedings of the 16th international conference on Knowledge Engineering: Practice and Patterns
RiMOM: A Dynamic Multistrategy Ontology Alignment Framework

IEEE Transactions on Knowledge and Data Engineering
Discovering and Maintaining Links on the Web of Data

ISWC '09 Proceedings of the 8th International Semantic Web Conference
Feature-based entity matching: the FBEM model, implementation, evaluation

CAiSE'10 Proceedings of the 22nd international conference on Advanced information systems engineering
A self-training approach for resolving object coreference on the semantic web

Proceedings of the 20th international conference on World wide web
A Genetic Programming Approach to Record Deduplication

IEEE Transactions on Knowledge and Data Engineering
Leveraging terminological structure for object reconciliation

ESWC'10 Proceedings of the 7th international conference on The Semantic Web: research and Applications - Volume Part II

Link discovery with guaranteed reduction ratio in affine spaces with minkowski measures

ISWC'12 Proceedings of the 11th international conference on The Semantic Web - Volume Part I
DEQA: deep web extraction for question answering

ISWC'12 Proceedings of the 11th international conference on The Semantic Web - Volume Part II
An automatic key discovery approach for data linking

Web Semantics: Science, Services and Agents on the World Wide Web

Quantified Score

Hi-index	0.00

Visualization

Abstract

Discovering links between overlapping datasets on the Web is generally realised through the use of fuzzy similarity measures. Configuring such measures is often a non-trivial task that depends on the domain, ontological schemas, and formatting conventions in data. Existing solutions either rely on the user's knowledge of the data and the domain or on the use of machine learning to discover these parameters based on training data. In this paper, we present a novel approach to tackle the issue of data linking which relies on the unsupervised discovery of the required similarity parameters. Instead of using labeled data, the method takes into account several desired properties which the distribution of output similarity values should satisfy. The method includes these features into a fitness criterion used in a genetic algorithm to establish similarity parameters that maximise the quality of the resulting linkset according to the considered properties. We show in experiments using benchmarks as well as real-world datasets that such an unsupervised method can reach the same levels of performance as manually engineered methods, and how the different parameters of the genetic algorithm and the fitness criterion affect the results for different datasets.