Efficient similarity joins for near duplicate detection

Authors:
Chuan Xiao;Wei Wang;Xuemin Lin;Jeffrey Xu Yu
Affiliations:
University of New South Wales, Kensington, NSW, Australia;University of New South Wales, Kensington, NSW, Australia;University of New South Wales, Kensington, NSW, Australia;Chinese University of Hong Kong, Hong Kong, China
Venue:
Proceedings of the 17th international conference on World Wide Web
Year:
2008

Citing 25
Cited 77

Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Finding replicated Web collections

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Collection statistics for fast duplicate document detection

ACM Transactions on Information Systems (TOIS)
Similarity estimation techniques from rounding algorithms

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Modern Information Retrieval

Modern Information Retrieval
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

Data Mining and Knowledge Discovery
Similarity Search in High Dimensions via Hashing

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
On Approximate String Matching

Proceedings of the 1983 International FCT-Conference on Fundamentals of Computation Theory
Methods for identifying versioned and plagiarized documents

Journal of the American Society for Information Science and Technology
Interactive deduplication using active learning

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Efficient similarity search and classification via rank aggregation

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
On the Evolution of Clusters of Near-Duplicate Web Pages

LA-WEB '03 Proceedings of the First Conference on Latin American Web Congress
Online duplicate document detection: signature reliability in a dynamic retrieval environment

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Efficient set joins on similarity predicates

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Evaluating similarity measures: a large-scale study in the orkut social network

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Discovering large dense subgraphs in massive graphs

VLDB '05 Proceedings of the 31st international conference on Very large data bases
A Primitive Operator for Similarity Joins in Data Cleaning

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Adaptive Name Matching in Information Integration

IEEE Intelligent Systems
Finding near-duplicate web pages: a large-scale evaluation of algorithms

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Efficient exact set-similarity joins

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Scaling up all pairs similarity search

Proceedings of the 16th international conference on World Wide Web

Ed-Join: an efficient algorithm for similarity joins with edit distance constraints

Proceedings of the VLDB Endowment
Efficient overlap and content reuse detection in blogs and online news articles

Proceedings of the 18th international conference on World wide web
Efficient interactive fuzzy keyword search

Proceedings of the 18th international conference on World wide web
Fast error-tolerant search on very large texts

Proceedings of the 2009 ACM symposium on Applied Computing
Efficient approximate entity extraction with edit distance constraints

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Frequent Itemset Mining for Clustering Near Duplicate Web Documents

ICCS '09 Proceedings of the 17th International Conference on Conceptual Structures: Conceptual Structures: Leveraging Semantic Technologies
Efficient Set Similarity Joins Using Min-prefixes

ADBIS '09 Proceedings of the 13th East European Conference on Advances in Databases and Information Systems
Fast Matching for All Pairs Similarity Search

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Incremental similarity joins with edit distance constraints

Proceedings of the 18th ACM conference on Information and knowledge management
Efficient approximate search on string collections

Proceedings of the VLDB Endowment
Exploiting Sentence-Level Features for Near-Duplicate Document Detection

AIRS '09 Proceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology
Incremental all pairs similarity search for varying similarity thresholds

Proceedings of the 3rd Workshop on Social Network Mining and Analysis
HARRA: fast iterative hashed record linkage for large-scale data collections

Proceedings of the 13th International Conference on Extending Database Technology
A pattern tree-based approach to learning URL normalization rules

Proceedings of the 19th international conference on World wide web
Efficient parallel set-similarity joins using MapReduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
MapDupReducer: detecting near duplicates over massive datasets

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Similarity joins as stronger metric operations

SIGSPATIAL Special
Generalizing prefix filtering to improve set similarity joins

Information Systems
CasJoin: a cascade chain for text similarity joins

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
An indexing scheme for fast and accurate chemical fingerprint database searching

SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
Duplicate identification in deep web data integration

WAIM'10 Proceedings of the 11th international conference on Web-age information management
Exact and efficient proximity graph computation

ADBIS'10 Proceedings of the 14th east European conference on Advances in databases and information systems
An efficient similarity join algorithm with cosine similarity predicate

DEXA'10 Proceedings of the 21st international conference on Database and expert systems applications: Part II
Scaling up top-K cosine similarity search

Data & Knowledge Engineering
Evaluation of entity resolution approaches on real-world match problems

Proceedings of the VLDB Endowment
Set similarity join on probabilistic data

Proceedings of the VLDB Endowment
Trie-join: efficient trie-based string similarity joins with edit-distance constraints

Proceedings of the VLDB Endowment
Towards active detection of identity clone attacks on online social networks

Proceedings of the first ACM conference on Data and application security and privacy
Context-sensitive document ranking

Journal of Computer Science and Technology
Fixing the threshold for effective detection of near duplicate web documents in web crawling

ADMA'10 Proceedings of the 6th international conference on Advanced data mining and applications: Part I
Approximate entity extraction in temporal databases

World Wide Web
Efficient k-nearest neighbor graph construction for generic similarity measures

Proceedings of the 20th international conference on World wide web
Approximate String Processing

Foundations and Trends in Databases
Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
ATLAS: a probabilistic algorithm for high dimensional similarity search

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Batch text similarity search with MapReduce

APWeb'11 Proceedings of the 13th Asia-Pacific web conference on Web technologies and applications
Efficient similarity joins for near-duplicate detection

ACM Transactions on Database Systems (TODS)
A supervised machine learning approach for duplicate detection over gazetteer records

GeoS'11 Proceedings of the 4th international conference on GeoSpatial semantics
Efficient fuzzy full-text type-ahead search

The VLDB Journal — The International Journal on Very Large Data Bases
Learning top-k transformation rules

DEXA'11 Proceedings of the 22nd international conference on Database and expert systems applications - Volume Part I
Efficient duplicate detection on cloud using a new signature scheme

WAIM'11 Proceedings of the 12th international conference on Web-age information management
Efficient top-K approximate searches against a relation with multiple attributes

World Wide Web
Automatically generating data linkages using a domain-independent candidate selection approach

ISWC'11 Proceedings of the 10th international conference on The semantic web - Volume Part I
Context-based entity description rule for entity resolution

Proceedings of the 20th ACM international conference on Information and knowledge management
Pass-join: a partition-based method for similarity joins

Proceedings of the VLDB Endowment
SpSJoin: parallel spatial similarity joins

Proceedings of the 19th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems
Bayesian locality sensitive hashing for fast similarity search

Proceedings of the VLDB Endowment
Efficient processing of probabilistic set-containment queries on uncertain set-valued data

Information Sciences: an International Journal
Clustering and load balancing optimization for redundant content removal

Proceedings of the 21st international conference companion on World Wide Web
V-SMART-join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors

Proceedings of the VLDB Endowment
Can we beat the prefix filtering?: an adaptive framework for similarity join and search

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Measuring semantic similarity between words by removing noise and redundancy in web snippets

Concurrency and Computation: Practice & Experience
CRSI: a compact randomized similarity index for set-valued features

Proceedings of the 15th International Conference on Extending Database Technology
Learning hash codes for efficient content reuse detection

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Efficient range queries over uncertain strings

SSDBM'12 Proceedings of the 24th international conference on Scientific and Statistical Database Management
An optimized in-network aggregation scheme for data collection in periodic sensor networks

ADHOC-NOW'12 Proceedings of the 11th international conference on Ad-hoc, Mobile, and Wireless Networks
Recommendations using linked data

Proceedings of the 5th Ph.D. workshop on Information and knowledge
Link discovery with guaranteed reduction ratio in affine spaces with minkowski measures

ISWC'12 Proceedings of the 11th international conference on The Semantic Web - Volume Part I
DEQA: deep web extraction for question answering

ISWC'12 Proceedings of the 11th international conference on The Semantic Web - Volume Part II
Of cubes, DAGs and hierarchical correlations: a novel conceptual model for analyzing social media data

ER'12 Proceedings of the 31st international conference on Conceptual Modeling
Optimizing parallel algorithms for all pairs similarity search

Proceedings of the sixth ACM international conference on Web search and data mining
Domain-Independent Entity Coreference for Linking Ontology Instances

Journal of Data and Information Quality (JDIQ) - Special Issue on Entity Resolution
A MapReduce-based filtering algorithm for vector similarity join

Proceedings of the 7th International Conference on Ubiquitous Information Management and Communication
Efficient parallel partition-based algorithms for similarity search and join with edit distance constraints

Proceedings of the Joint EDBT/ICDT 2013 Workshops
Trie-based similarity search and join

Proceedings of the Joint EDBT/ICDT 2013 Workshops
Cache-conscious performance optimization for similarity search

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Don't match twice: redundancy-free similarity computation with MapReduce

Proceedings of the Second Workshop on Data Analytics in the Cloud
A partition-based method for string similarity joins with edit-distance constraints

ACM Transactions on Database Systems (TODS)
Scalable all-pairs similarity search in metric spaces

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Distributed data management using MapReduce

ACM Computing Surveys (CSUR)
Efficient filtering and ranking schemes for finding inclusion dependencies on the web

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Entity resolution on uncertain relations

WAIM'13 Proceedings of the 14th international conference on Web-Age Information Management
Extending string similarity join to tolerant fuzzy token matching

ACM Transactions on Database Systems (TODS)
Scalable column concept determination for web tables using large knowledge bases

Proceedings of the VLDB Endowment
Clustering with Proximity Graphs: Exact and Efficient Algorithms

International Journal of Knowledge-Based Organizations
Using Non-Zero Dimensions for the Cosine and Tanimoto Similarity Search Among Real Valued Vectors

Fundamenta Informaticae - To Andrzej Skowron on His 70th Birthday
EsPRESSO: Efficient privacy-preserving evaluation of sample set similarity

Journal of Computer Security

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the increasing amount of data and the need to integrate data from multiple data sources, a challenging issue is to find near duplicate records efficiently. In this paper, we focus on efficient algorithms to find pairs of records such that their similarities are above a given threshold. Several existing algorithms rely on the prefix filtering principle to avoid computing similarity values for all possible pairs of records. We propose new filtering techniques by exploiting the ordering information; they are integrated into the existing methods and drastically reduce the candidate sizes and hence improve the efficiency. Experimental results show that our proposed algorithms can achieve up to 2.6x - 5x speed-up over previous algorithms on several real datasets and provide alternative solutions to the near duplicate Web page detection problem.