Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora

Authors:
Aidan Hogan;Antoine Zimmermann;Jürgen Umbrich;Axel Polleres;Stefan Decker
Affiliations:
Digital Enterprise Research Institute, National University of Ireland, Galway, Ireland;INSA-Lyon, LIRIS, UMR5205, Villeurbanne F-69621, France;Digital Enterprise Research Institute, National University of Ireland, Galway, Ireland;Siemens AG Österreich, Siemensstrasse 90, 1210 Vienna, Austria;Digital Enterprise Research Institute, National University of Ireland, Galway, Ireland
Venue:
Web Semantics: Science, Services and Agents on the World Wide Web
Year:
2012

Citing 40
Cited 7

A theory of diagnosis from first principles

Artificial Intelligence
A simple min-cut algorithm

Journal of the ACM (JACM)
On (un)suitable fuzzy relations to model approximate equality

Fuzzy Sets and Systems - Theme: Basic notions
Should fuzzy equality and similarity satisfy transitivity? comments on the paper by M. De Cock and E. Kerre

Fuzzy Sets and Systems - Theme: Basic notions
Exploiting relationships for object consolidation

Proceedings of the 2nd international workshop on Information quality in information systems
Interactive schema translation with instance-level mappings

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Integration of Semantically Annotated Data by the KnoFuss Architecture

EKAW '08 Proceedings of the 16th international conference on Knowledge Engineering: Practice and Patterns
Data fusion

ACM Computing Surveys (CSUR)
Sindice.com: a document-oriented lookup index for open linked data

International Journal of Metadata, Semantics and Ontologies
idMesh: graph-based disambiguation of linked data

Proceedings of the 18th international conference on World wide web
Combining a Logical and a Numerical Method for Data Reconciliation

Journal on Data Semantics XII
Interlinking Music-Related Data on the Web

IEEE MultiMedia
RiMOM: A Dynamic Multistrategy Ontology Alignment Framework

IEEE Transactions on Knowledge and Data Engineering
Marvin: Distributed reasoning over large-scale Semantic Web data

Web Semantics: Science, Services and Agents on the World Wide Web
Reasoning about record matching rules

Proceedings of the VLDB Endowment
A Semantic Similarity Measure for Ontology-Based Information

FQAS '09 Proceedings of the 8th International Conference on Flexible Query Answering Systems
Scalable Distributed Reasoning Using MapReduce

ISWC '09 Proceedings of the 8th International Semantic Web Conference
Discovering and Maintaining Links on the Web of Data

ISWC '09 Proceedings of the 8th International Semantic Web Conference
Parallel Materialization of the Finite RDFS Closure for Hundreds of Millions of Triples

ISWC '09 Proceedings of the 8th International Semantic Web Conference
LinksB2N: Automatic Data Integration for the Semantic Web

OTM '09 Proceedings of the Confederated International Conferences, CoopIS, DOA, IS, and ODBASE 2009 on On the Move to Meaningful Internet Systems: Part II
Schema AND Data: A Holistic Approach to Mapping, Resolution and Fusion in Information Integration

ER '09 Proceedings of the 28th International Conference on Conceptual Modeling
Completeness, decidability and complexity of entailment for RDF Schema and a semantic extension involving the OWL vocabulary

Web Semantics: Science, Services and Agents on the World Wide Web
DLEJena: A practical forward-chaining OWL 2 RL reasoner combining Jena and Pellet

Web Semantics: Science, Services and Agents on the World Wide Web
DSNotify: handling broken links in the web of data

Proceedings of the 19th international conference on World wide web
Leveraging ontologies, context and social networks to automate photo annotation

SAMT'07 Proceedings of the semantic and digital media technologies 2nd international conference on Semantic Multimedia
RKBExplorer.com: a knowledge driven infrastructure for linked data providers

ESWC'08 Proceedings of the 5th European semantic web conference on The semantic web: research and applications
Asymmetric and context-dependent semantic similarity among ontology instances

Journal on data semantics X
SameAs networks and beyond: analyzing deployment status and implications of owl:sameAs in linked data

ISWC'10 Proceedings of the 9th international semantic web conference on The semantic web - Volume Part I
When owl: sameAs isn't the same: an analysis of identity in linked data

ISWC'10 Proceedings of the 9th international semantic web conference on The semantic web - Volume Part I
SAOR: template rule optimisations for distributed reasoning over 1 billion linked data triples

ISWC'10 Proceedings of the 9th international semantic web conference on The semantic web - Volume Part I
Optimizing enterprise-scale OWL 2 RL reasoning in a relational database system

ISWC'10 Proceedings of the 9th international semantic web conference on The semantic web - Volume Part I
A self-training approach for resolving object coreference on the semantic web

Proceedings of the 20th international conference on World wide web
Automatically generating data linkages using a domain-independent candidate selection approach

ISWC'11 Proceedings of the 10th international conference on The semantic web - Volume Part I
Searching and browsing Linked Data with SWSE: The Semantic Web Search Engine

Web Semantics: Science, Services and Agents on the World Wide Web
Semantic similarity of ontology instances tailored on the application context

ODBASE'06/OTM'06 Proceedings of the 2006 Confederated international conference on On the Move to Meaningful Internet Systems: CoopIS, DOA, GADA, and ODBASE - Volume Part I
Efficient semantic-aware detection of near duplicate resources

ESWC'10 Proceedings of the 7th international conference on The Semantic Web: research and Applications - Volume Part II
Leveraging terminological structure for object reconciliation

ESWC'10 Proceedings of the 7th international conference on The Semantic Web: research and Applications - Volume Part II
OWL reasoning with WebPIE: calculating the closure of 100 billion triples

ESWC'10 Proceedings of the 7th international conference on The Semantic Web: research and Applications - Volume Part I
FactForge: a fast track to the web of data

Semantic Web

Towards fuzzy query-relaxation for RDF

ESWC'12 Proceedings of the 9th international conference on The Semantic Web: research and applications
Pay-less entity consolidation: exploiting entity search user feedbacks for pay-as-you-go entity data integration

Proceedings of the 3rd Annual ACM Web Science Conference
LINDA: distributed web-of-data-scale entity matching

Proceedings of the 21st ACM international conference on Information and knowledge management
Towards big linked data: a large-scale, distributed semantic data storage

Proceedings of the 14th International Conference on Information Integration and Web-based Applications & Services
Binary RDF representation for publication and exchange (HDT)

Web Semantics: Science, Services and Agents on the World Wide Web
Knowledge harvesting in the big-data era

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
RDFS and OWL reasoning for linked data

RW'13 Proceedings of the 9th international conference on Reasoning Web: semantic technologies for intelligent data access

Quantified Score

Hi-index	0.00

Visualization

Abstract

With respect to large-scale, static, Linked Data corpora, in this paper we discuss scalable and distributed methods for entity consolidation (aka. smushing, entity resolution, object consolidation, etc.) to locate and process names that signify the same entity. We investigate (i) a baseline approach, which uses explicit owl: sameAs relations to perform consolidation; (ii) extended entity consolidation which additionally uses a subset of OWL 2 RL/RDF rules to derive novel owl:sameAs relations through the semantics of inverse-functional properties, functional-properties and (max-)cardinality restrictions with value one; (iii) deriving weighted concurrence measures between entities in the corpus based on shared inlinks/outlinks and attribute values using statistical analyses; (iv) disambiguating (initially) consolidated entities based on inconsistency detection using OWL 2 RL/RDF rules. Our methods are based upon distributed sorts and scans of the corpus, where we deliberately avoid the requirement for indexing all data. Throughout, we offer evaluation over a diverse Linked Data corpus consisting of 1.118 billion quadruples derived from a domain-agnostic, open crawl of 3.985 million RDF/XML Web documents, demonstrating the feasibility of our methods at that scale, and giving insights into the quality of the results for real-world data.