Copy detection mechanisms for digital documents
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Building a scalable and accurate copy detection mechanism
Proceedings of the first ACM international conference on Digital libraries
Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
Similarity estimation techniques from rounding algorithms
STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Finding Near-Replicas of Documents and Servers on the Web
WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Methods for identifying versioned and plagiarized documents
Journal of the American Society for Information Science and Technology
On the Evolution of Clusters of Near-Duplicate Web Pages
LA-WEB '03 Proceedings of the First Conference on Latin American Web Congress
Detecting phrase-level duplication on the world wide web
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Detecting near-duplicates for web crawling
Proceedings of the 16th international conference on World Wide Web
Multiple-signal duplicate detection for search evaluation
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Principles of hash-based text retrieval
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Tracking multiple topics for finding interesting articles
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Scalable near identical image and shot detection
Proceedings of the 6th ACM international conference on Image and video retrieval
Combinatorial algorithms for web search engines: three success stories
SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Tracking Web spam with HTML style similarities
ACM Transactions on the Web (TWEB)
WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Efficient similarity joins for near duplicate detection
Proceedings of the 17th international conference on World Wide Web
iRobot: an intelligent crawler for web forums
Proceedings of the 17th international conference on World Wide Web
Exploring traversal strategy for web forum crawling
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
SpotSigs: robust and efficient near duplicate detection in large web collections
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
De-duping URLs via rewrite rules
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Quantitative comparisons of search engine results
Journal of the American Society for Information Science and Technology
Trusting spam reporters: A reporter-based reputation system for email filtering
ACM Transactions on Information Systems (TOIS)
Ed-Join: an efficient algorithm for similarity joins with edit distance constraints
Proceedings of the VLDB Endowment
Achieving both high precision and high recall in near-duplicate detection
Proceedings of the 17th ACM conference on Information and knowledge management
Proceedings of the Second ACM International Conference on Web Search and Data Mining
Detecting the origin of text segments efficiently
Proceedings of the 18th international conference on World wide web
IRLbot: Scaling to 6 billion pages and beyond
ACM Transactions on the Web (TWEB)
Leveraging discarded samples for tighter estimation of multiple-set aggregates
Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Applying syntactic similarity algorithms for enterprise information management
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Near-duplicate detection for web-forums
IDEAS '09 Proceedings of the 2009 International Database Engineering & Applications Symposium
Automatic retrieval of similar content using search engine query interface
Proceedings of the 18th ACM conference on Information and knowledge management
URL normalization for de-duplication of web pages
Proceedings of the 18th ACM conference on Information and knowledge management
Proceedings of the VLDB Endowment
Exploiting Sentence-Level Features for Near-Duplicate Document Detection
AIRS '09 Proceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology
Learning URL patterns for webpage de-duplication
Proceedings of the third ACM international conference on Web search and data mining
Detecting visually similar Web pages: Application to phishing detection
ACM Transactions on Internet Technology (TOIT)
Fast approximate duplicate detection for 2D-NMR spectra
DILS'07 Proceedings of the 4th international conference on Data integration in the life sciences
A pattern tree-based approach to learning URL normalization rules
Proceedings of the 19th international conference on World wide web
Organizing news archives by near-duplicate copy detection in digital libraries
ICADL'07 Proceedings of the 10th international conference on Asian digital libraries: looking back 10 years and forging new frontiers
Detecting near-duplicates in large-scale short text databases
PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
Weighted shingling: an adaptation of shingling for weighted shingles
IIT'09 Proceedings of the 6th international conference on Innovations in information technology
Efficient parallel set-similarity joins using MapReduce
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Proceedings of the 21st ACM conference on Hypertext and hypermedia
Self-taught hashing for fast similarity search
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Adaptive near-duplicate detection via similarity learning
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Efficient partial-duplicate detection based on sequence matching
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Measuring the interestingness of articles in a limited user environment
Information Processing and Management: an International Journal
Simple and efficient algorithm for approximate dictionary matching
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Large scale parallel document mining for machine translation
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
An efficient similarity join algorithm with cosine similarity predicate
DEXA'10 Proceedings of the 21st international conference on Database and expert systems applications: Part II
Hypergeometric language model and zipf-like scoring function for web document similarity retrieval
SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Learning website hierarchies for keyword enrichment in contextual advertising
Proceedings of the fourth ACM international conference on Web search and data mining
Fixing the threshold for effective detection of near duplicate web documents in web crawling
ADMA'10 Proceedings of the 6th international conference on Advanced data mining and applications: Part I
Detecting near-duplicate relations in user generated forum content
OTM'10 Proceedings of the 2010 international conference on On the move to meaningful internet systems
Exponential time improvement for min-wise based algorithms
Information and Computation
Language Resources and Evaluation
PRESIDIO: A Framework for Efficient Archival Data Storage
ACM Transactions on Storage (TOS)
ATLAS: a probabilistic algorithm for high dimensional similarity search
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient similarity joins for near-duplicate detection
ACM Transactions on Database Systems (TODS)
Hypergeometric language models for republished article finding
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Query by document via a decomposition-based two-level retrieval approach
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
No free lunch: brute force vs. locality-sensitive hashing for cross-lingual pairwise similarity
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
SizeSpotSigs: an effective deduplicate algorithm considering the size of page content
PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
Fast locality-sensitive hashing
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
PG-join: proximity graph based string similarity joins
SSDBM'11 Proceedings of the 23rd international conference on Scientific and statistical database management
Efficient duplicate detection on cloud using a new signature scheme
WAIM'11 Proceedings of the 12th international conference on Web-age information management
A Text Similarity Meta-Search Engine Based on Document Fingerprints and Search Results Records
WI-IAT '11 Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Partial duplicate detection for large book collections
Proceedings of the 20th ACM international conference on Information and knowledge management
Probabilistic near-duplicate detection using simhash
Proceedings of the 20th ACM international conference on Information and knowledge management
Measuring redundancy level on the web
AINTEC '11 Proceedings of the 7th Asian Internet Engineering Conference
Temporal shingling for version identification in web archives
ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Exponential time improvement for min-wise based algorithms
Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms
Bayesian locality sensitive hashing for fast similarity search
Proceedings of the VLDB Endowment
Keeping Found Things Found: The Study and Practice of Personal Information Management: The Study and Practice of Personal Information Management
A fusion of algorithms in near duplicate document detection
PAKDD'11 Proceedings of the 15th international conference on New Frontiers in Applied Data Mining
FoCUS: learning to crawl web forums
Proceedings of the 21st international conference companion on World Wide Web
Clustering and load balancing optimization for redundant content removal
Proceedings of the 21st international conference companion on World Wide Web
V-SMART-join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors
Proceedings of the VLDB Endowment
Exploring temporal evidence in web information retrieval
FDIA'07 Proceedings of the 1st BCS IRSG conference on Future Directions in Information Access
CRSI: a compact randomized similarity index for set-valued features
Proceedings of the 15th International Conference on Extending Database Technology
Multi-resolution similarity hashing
Digital Investigation: The International Journal of Digital Forensics & Incident Response
Index maintenance for time-travel text search
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Detecting quilted web pages at scale
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Efficient range queries over uncertain strings
SSDBM'12 Proceedings of the 24th international conference on Scientific and Statistical Database Management
Learning to rank duplicate bug reports
Proceedings of the 21st ACM international conference on Information and knowledge management
Semi-supervised spectral hashing for fast similarity search
Neurocomputing
Rank hash similarity for fast similarity search
Information Processing and Management: an International Journal
Detecting near-duplicate documents using sentence-level features and supervised learning
Expert Systems with Applications: An International Journal
Reducing information redundancy in search results
Proceedings of the 28th Annual ACM Symposium on Applied Computing
Scalable all-pairs similarity search in metric spaces
Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Groundhog day: near-duplicate detection on Twitter
Proceedings of the 22nd international conference on World Wide Web
Bottom-k and priority sampling, set similarity and subset sums with minimal independence
Proceedings of the forty-fifth annual ACM symposium on Theory of computing
Near duplicate detection in an academic digital library
Proceedings of the 2013 ACM symposium on Document engineering
A pattern-based selective recrawling approach for object-level vertical search
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Streaming similarity search over one billion tweets using parallel locality-sensitive hashing
Proceedings of the VLDB Endowment
Efficient estimation for high similarities using odd sketches
Proceedings of the 23rd international conference on World wide web
Hi-index | 0.00 |
Broder et al.'s [3] shingling algorithm and Charikar's [4] random projection based approach are considered "state-of-the-art" algorithms for finding near-duplicate web pages. Both algorithms were either developed at or used by popular web search engines. We compare the two algorithms on a very large scale, namely on a set of 1.6B distinct web pages. The results show that neither of the algorithms works well for finding near-duplicate pairs on the same site, while both achieve high precision for near-duplicate pairs on different sites. Since Charikar's algorithm finds more near-duplicate pairs on different sites, it achieves a better precision overall, namely 0.50 versus 0.38 for Broder et al.'s algorithm. We present a combined algorithm which achieves precision 0.79 with 79% of the recall of the other algorithms.