Efficient semantic-aware detection of near duplicate resources

Authors:
Ekaterini Ioannou;Odysseas Papapetrou;Dimitrios Skoutas;Wolfgang Nejdl
Affiliations:
L3S Research Center/Leibniz Universität Hannover;L3S Research Center/Leibniz Universität Hannover;L3S Research Center/Leibniz Universität Hannover;L3S Research Center/Leibniz Universität Hannover
Venue:
ESWC'10 Proceedings of the 7th international conference on The Semantic Web: research and Applications - Volume Part II
Year:
2010

Citing 13
Cited 3

Min-wise independent permutations (extended abstract)

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric

Journal of the ACM (JACM)
Similarity estimation techniques from rounding algorithms

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Similarity Search in High Dimensions via Hashing

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Locality-sensitive hashing scheme based on p-stable distributions

SCG '04 Proceedings of the twentieth annual symposium on Computational geometry
Iterative record linkage for cleaning and integration

Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
LSH forest: self-tuning indexes for similarity search

WWW '05 Proceedings of the 14th international conference on World Wide Web
Reference reconciliation in complex information spaces

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Semantic analytics on social networks: experiences in addressing the problem of conflict of interest detection

Proceedings of the 15th international conference on World Wide Web
Domain-independent data cleaning via analysis of entity-relationship graph

ACM Transactions on Database Systems (TODS)
Detecting near-duplicates for web crawling

Proceedings of the 16th international conference on World Wide Web
Probabilistic Entity Linkage for Heterogeneous Information Spaces

CAiSE '08 Proceedings of the 20th international conference on Advanced Information Systems Engineering
Leveraging personal metadata for Desktop search: The Beagle++ system

Web Semantics: Science, Services and Agents on the World Wide Web

Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora

Web Semantics: Science, Services and Agents on the World Wide Web
Towards fuzzy query-relaxation for RDF

ESWC'12 Proceedings of the 9th international conference on The Semantic Web: research and applications
Domain-Independent Entity Coreference for Linking Ontology Instances

Journal of Data and Information Quality (JDIQ) - Special Issue on Entity Resolution

Quantified Score

Hi-index	0.00

Visualization

Abstract

Efficiently detecting near duplicate resources is an important task when integrating information from various sources and applications. Once detected, near duplicate resources can be grouped together, merged, or removed, in order to avoid repetition and redundancy, and to increase the diversity in the information provided to the user. In this paper, we introduce an approach for efficient semantic-aware near duplicate detection, by combining an indexing scheme for similarity search with the RDF representations of the resources. We provide a probabilistic analysis for the correctness of the suggested approach, which allows applications to configure it for satisfying their specific quality requirements. Our experimental evaluation on the RDF descriptions of real-world news articles from various news agencies demonstrates the efficiency and effectiveness of our approach.