Duplicate Record Detection: A Survey
IEEE Transactions on Knowledge and Data Engineering
Entity Name System: The Back-Bone of an Open and Scalable Web of Data
ICSC '08 Proceedings of the 2008 IEEE International Conference on Semantic Computing
Web page language identification based on URLs
Proceedings of the VLDB Endowment
Purely URL-based topic classification
Proceedings of the 18th international conference on World wide web
A framework for semantic link discovery over relational data
Proceedings of the 18th ACM conference on Information and knowledge management
URL normalization for de-duplication of web pages
Proceedings of the 18th ACM conference on Information and knowledge management
Discovering and Maintaining Links on the Web of Data
ISWC '09 Proceedings of the 8th International Semantic Web Conference
Learning URL patterns for webpage de-duplication
Proceedings of the third ACM international conference on Web search and data mining
A pattern tree-based approach to learning URL normalization rules
Proceedings of the 19th international conference on World wide web
ICCSA'05 Proceedings of the 2005 international conference on Computational Science and Its Applications - Volume Part II
Unsupervised duplicate detection using sample non-duplicates
Journal on Data Semantics VII
Beyond 100 million entities: large-scale blocking-based resolution for heterogeneous data
Proceedings of the fifth ACM international conference on Web search and data mining
Hi-index | 0.00 |
The Semantic Web is constantly gaining momentum, as more and more Web sites and content providers adopt its principles. At the core of these principles lies the Linked Data movement, which demands that data on the Web shall be annotated and linked among different sources, instead of being isolated in data silos. In order to materialize this vision of a web of semantics, existing resource identifiers should be reused and shared between different Web sites. This is not always the case with the current state of the Semantic Web, since multiple identifiers are, more often than not, redundantly introduced for the same resources. In this paper we introduce a novel approach to automatically detect redundant identifiers solely by matching the URIs of information resources. The approach, based on a common pattern among Semantic Web URIs, provides a simple and practical method for duplicate detection. We apply this method on a large snapshot of the current Semantic Web comprising 1.15 billion statements and estimate the number of hidden duplicates in it. The outcomes of our experiments confirm the effectiveness as well as the efficiency of our method, and suggest that URI matching can be used as a scalable filter for discovering implicit same-as links.