Automatically generating data linkages using a domain-independent candidate selection approach

Authors:
Dezhao Song;Jeff Heflin
Affiliations:
Department of Computer Science and Engineering, Lehigh University, Bethlehem, PA;Department of Computer Science and Engineering, Lehigh University, Bethlehem, PA
Venue:
ISWC'11 Proceedings of the 10th international conference on The semantic web - Volume Part I
Year:
2011

Citing 16
Cited 14

Learning domain-independent string transformation weights for high accuracy object identification

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
TAILOR: A Record Linkage Tool Box

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Scaling up all pairs similarity search

Proceedings of the 16th international conference on World Wide Web
Adaptive sorted neighborhood methods for efficient record linkage

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Efficient similarity joins for near duplicate detection

Proceedings of the 17th international conference on World Wide Web
Ed-Join: an efficient algorithm for similarity joins with edit distance constraints

Proceedings of the VLDB Endowment
Learning blocking schemes for record linkage

AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
Creating relational data from unstructured and ungrammatical data sources

Journal of Artificial Intelligence Research
Discovering and Maintaining Links on the Web of Data

ISWC '09 Proceedings of the 8th International Semantic Web Conference
RKBExplorer.com: a knowledge driven infrastructure for linked data providers

ESWC'08 Proceedings of the 5th European semantic web conference on The semantic web: research and applications
Domain-independent entity coreference in RDF graphs

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
A self-training approach for resolving object coreference on the semantic web

Proceedings of the 20th international conference on World wide web
Ontology-driven automatic entity disambiguation in unstructured text

ISWC'06 Proceedings of the 5th international conference on The Semantic Web
Mining information for instance unification

ISWC'06 Proceedings of the 5th international conference on The Semantic Web
Leveraging unlabeled data to scale blocking for record linkage

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Three

Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora

Web Semantics: Science, Services and Agents on the World Wide Web
EAGLE: efficient active learning of link specifications using genetic programming

ESWC'12 Proceedings of the 9th international conference on The Semantic Web: research and applications
Data linking with ontology alignment

ESWC'12 Proceedings of the 9th international conference on The Semantic Web: research and applications
Learning approach for domain-independent linked data instance matching

Proceedings of the ACM SIGKDD Workshop on Mining Data Semantics
Pay-less entity consolidation: exploiting entity search user feedbacks for pay-as-you-go entity data integration

Proceedings of the 3rd Annual ACM Web Science Conference
Keys and pseudo-keys detection for web datasets cleansing and interlinking

EKAW'12 Proceedings of the 18th international conference on Knowledge Engineering and Knowledge Management
Link discovery with guaranteed reduction ratio in affine spaces with minkowski measures

ISWC'12 Proceedings of the 11th international conference on The Semantic Web - Volume Part I
A machine learning approach for instance matching based on similarity metrics

ISWC'12 Proceedings of the 11th international conference on The Semantic Web - Volume Part I
DEQA: deep web extraction for question answering

ISWC'12 Proceedings of the 11th international conference on The Semantic Web - Volume Part II
Scalable and domain-independent entity coreference: establishing high quality data linkages across heterogeneous data sources

ISWC'12 Proceedings of the 11th international conference on The Semantic Web - Volume Part II
Online unsupervised coreference resolution for semi-structured heterogeneous data

ISWC'12 Proceedings of the 11th international conference on The Semantic Web - Volume Part II
TYPiMatch: type-specific unsupervised learning of keys and key values for heterogeneous web data integration

Proceedings of the sixth ACM international conference on Web search and data mining
Hybrid event recommendation using linked data and user diversity

Proceedings of the 7th ACM conference on Recommender systems
An automatic key discovery approach for data linking

Web Semantics: Science, Services and Agents on the World Wide Web

Quantified Score

Hi-index	0.00

Visualization

Abstract

One challenge for Linked Data is scalably establishing highquality owl:sameAs links between instances (e.g., people, geographical locations, publications, etc.) in different data sources. Traditional approaches to this entity coreference problem do not scale because they exhaustively compare every pair of instances. In this paper, we propose a candidate selection algorithm for pruning the search space for entity coreference. We select candidate instance pairs by computing a character-level similarity on discriminating literal values that are chosen using domain-independent unsupervised learning. We index the instances on the chosen predicates' literal values to efficiently look up similar instances. We evaluate our approach on two RDF and three structured datasets. We show that the traditional metrics don't always accurately reflect the relative benefits of candidate selection, and propose additional metrics. We show that our algorithm frequently outperforms alternatives and is able to process 1 million instances in under one hour on a single Sun Workstation. Furthermore, on the RDF datasets, we show that the entire entity coreference process scales well by applying our technique. Surprisingly, this high recall, low precision filtering mechanism frequently leads to higher F-scores in the overall system.