Copy detection mechanisms for digital documents
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Building a scalable and accurate copy detection mechanism
Proceedings of the first ACM international conference on Digital libraries
Min-wise independent permutations (extended abstract)
STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
A technique for measuring the relative size and overlap of public Web search engines
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Min-Wise versus linear independence (extended abstract)
SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
The AltaVista Revolution: How to Find Anything on the Internet
The AltaVista Revolution: How to Find Anything on the Internet
Finding Near-Replicas of Documents and Servers on the Web
WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
On the Resemblance and Containment of Documents
SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Finding similar files in a large file system
WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
A Derandomization Using Min-Wise Independent Permutations
RANDOM '98 Proceedings of the Second International Workshop on Randomization and Approximation Techniques in Computer Science
Frequent Itemset Mining for Clustering Near Duplicate Web Documents
ICCS '09 Proceedings of the 17th International Conference on Conceptual Structures: Conceptual Structures: Leveraging Semantic Technologies
Difference engine: harnessing memory redundancy in virtual machines
Communications of the ACM
Adaptive near-duplicate detection via similarity learning
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Efficient partial-duplicate detection based on sequence matching
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Efficient algorithms for large-scale local triangle counting
ACM Transactions on Knowledge Discovery from Data (TKDD)
Evaluating text reuse discovery on the web
Proceedings of the third symposium on Information interaction in context
IEEE Transactions on Information Theory
Estimating set intersection using small samples
ACSC '10 Proceedings of the Thirty-Third Australasian Conferenc on Computer Science - Volume 102
PROSPECT: a system for screening candidates for recruitment
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Large scale parallel document mining for machine translation
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Feature map hashing: sub-linear indexing of appearance and global geometry
Proceedings of the international conference on Multimedia
Automatic detection of local reuse
EC-TEL'10 Proceedings of the 5th European conference on Technology enhanced learning conference on Sustaining TEL: from innovation to learning and practice
Optimal hash functions for approximate matches on the n-cube
IEEE Transactions on Information Theory
Detecting duplicate web documents using clickthrough data
Proceedings of the fourth ACM international conference on Web search and data mining
An evaluation framework for plagiarism detection
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Detecting near-duplicate relations in user generated forum content
OTM'10 Proceedings of the 2010 international conference on On the move to meaningful internet systems
Exponential time improvement for min-wise based algorithms
Information and Computation
Get the most out of your sample: optimal unbiased estimators using partial information
Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
SizeSpotSigs: an effective deduplicate algorithm considering the size of page content
PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
Two ways to use a noisy parallel news corpus for improving statistical machine translation
BUCC '11 Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
TextGraphs-6 Proceedings of TextGraphs-6: Graph-based Methods for Natural Language Processing
COCA filters: co-occurrence aware bloom filters
SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Probabilistic near-duplicate detection using simhash
Proceedings of the 20th ACM international conference on Information and knowledge management
A nugget-based test collection construction paradigm
Proceedings of the 20th ACM international conference on Information and knowledge management
DeFFS: Duplication-eliminated flash file system
Computers and Electrical Engineering
Enhancing duplicate collection detection through replica boundary discovery
PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Indexing shared content in information retrieval systems
EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
IR system evaluation using nugget-based test collections
Proceedings of the fifth ACM international conference on Web search and data mining
How user behavior is related to social affinity
Proceedings of the fifth ACM international conference on Web search and data mining
Temporal shingling for version identification in web archives
ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Exponential time improvement for min-wise based algorithms
Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms
Clustering and load balancing optimization for redundant content removal
Proceedings of the 21st international conference companion on World Wide Web
WAN optimized replication of backup datasets using stream-informed delta compression
FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Survey: Urban pervasive applications: Challenges, scenarios and case studies
Computer Science Review
Learning hash codes for efficient content reuse detection
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
WAN-optimized replication of backup datasets using stream-informed delta compression
ACM Transactions on Storage (TOS)
A model of uncertainty for near-duplicates in document reference networks
ECDL'07 Proceedings of the 11th European conference on Research and Advanced Technology for Digital Libraries
Constructing test collections by inferring document relevance via extracted relevant information
Proceedings of the 21st ACM international conference on Information and knowledge management
Efficient jaccard-based diversity analysis of large document collections
Proceedings of the 21st ACM international conference on Information and knowledge management
Cross-Language high similarity search using a conceptual thesaurus
CLEF'12 Proceedings of the Third international conference on Information Access Evaluation: multilinguality, multimodality, and visual analytics
A webpage deletion algorithm based on hierarchical filtering
WISM'12 Proceedings of the 2012 international conference on Web Information Systems and Mining
An ontology derived from heterogeneous sustainability indicator set documents
Proceedings of the Seventeenth Australasian Document Computing Symposium
Detecting near-duplicate documents using sentence-level features and supervised learning
Expert Systems with Applications: An International Journal
A novel approach for leveraging co-occurrence to improve the false positive error in signature files
Journal of Discrete Algorithms
Near-Duplicate detection for online-shops owners: an FCA-Based approach
ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
International Journal of Applied Cryptography
Reducing information redundancy in search results
Proceedings of the 28th Annual ACM Symposium on Applied Computing
Synthetic review spamming and defense
Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Bottom-k and priority sampling, set similarity and subset sums with minimal independence
Proceedings of the forty-fifth annual ACM symposium on Theory of computing
Towards large-scale geometry indexing by feature selection
Computer Vision and Image Understanding
Hi-index | 0.14 |
The mathematical concept of document resemblance captures well the informal notion of syntactic similarity. The resemblance can be estimated using a fixed size "sketch" for each document. For a large collection of documents (say hundreds of millions) the size of this sketch is of the order of a few hundred bytes per document. However, for effcient large scale web indexing it is not necessary to determine the actual resemblance value: it suffces to determine whether newly encountered documents are duplicates or near-duplicates of documents already indexed. In other words, it suffces to determine whether the resemblance is above a certain threshold. In this talk we show how this determination can be made using a "sample" of less than 50 bytes per document. The basic approach for computing resemblance has two aspects: first, resemblance is expressed as a set (of strings) intersection problem, and second, the relative size of intersections is evaluated by a process of random sampling that can be done independently for each document. The process of estimating the relative size of intersection of sets and the threshold test discussed above can be applied to arbitrary sets, and thus might be of independent interest. The algorithm for filtering near-duplicate documents discussed here has been successfully implemented and has been used for the last three years in the context of the AltaVista search engine.