The missing links: discovering hidden same-as links among a billion of triples

Authors:
George Papadakis;Gianluca Demartini;Peter Fankhauser;Philipp Kärger
Affiliations:
L3S Research Center, Hannover, Germany;L3S Research Center, Hannover, Germany;Fraunhofer IPSI, Darmstadt, Germany;L3S Research Center, Hannover, Germany
Venue:
Proceedings of the 12th International Conference on Information Integration and Web-based Applications & Services
Year:
2010

Citing 11
Cited 1

Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Entity Name System: The Back-Bone of an Open and Scalable Web of Data

ICSC '08 Proceedings of the 2008 IEEE International Conference on Semantic Computing
Web page language identification based on URLs

Proceedings of the VLDB Endowment
Purely URL-based topic classification

Proceedings of the 18th international conference on World wide web
A framework for semantic link discovery over relational data

Proceedings of the 18th ACM conference on Information and knowledge management
URL normalization for de-duplication of web pages

Proceedings of the 18th ACM conference on Information and knowledge management
Discovering and Maintaining Links on the Web of Data

ISWC '09 Proceedings of the 8th International Semantic Web Conference
Learning URL patterns for webpage de-duplication

Proceedings of the third ACM international conference on Web search and data mining
A pattern tree-based approach to learning URL normalization rules

Proceedings of the 19th international conference on World wide web
On URL normalization

ICCSA'05 Proceedings of the 2005 international conference on Computational Science and Its Applications - Volume Part II
Unsupervised duplicate detection using sample non-duplicates

Journal on Data Semantics VII

Beyond 100 million entities: large-scale blocking-based resolution for heterogeneous data

Proceedings of the fifth ACM international conference on Web search and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Semantic Web is constantly gaining momentum, as more and more Web sites and content providers adopt its principles. At the core of these principles lies the Linked Data movement, which demands that data on the Web shall be annotated and linked among different sources, instead of being isolated in data silos. In order to materialize this vision of a web of semantics, existing resource identifiers should be reused and shared between different Web sites. This is not always the case with the current state of the Semantic Web, since multiple identifiers are, more often than not, redundantly introduced for the same resources. In this paper we introduce a novel approach to automatically detect redundant identifiers solely by matching the URIs of information resources. The approach, based on a common pattern among Semantic Web URIs, provides a simple and practical method for duplicate detection. We apply this method on a large snapshot of the current Semantic Web comprising 1.15 billion statements and estimate the number of hidden duplicates in it. The outcomes of our experiments confirm the effectiveness as well as the efficiency of our method, and suggest that URI matching can be used as a scalable filter for discovering implicit same-as links.