Copy detection mechanisms for digital documents
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
How reliable are the results of large-scale information retrieval experiments?
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Collection statistics for fast duplicate document detection
ACM Transactions on Information Systems (TOIS)
Similarity estimation techniques from rounding algorithms
STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Efficiency of data structures for detecting overlaps in digital documents
ACSC '01 Proceedings of the 24th Australasian conference on Computer science
On the Resemblance and Containment of Documents
SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Managing déjà vu: Collection building for the identification of nonidentical duplicate documents
Journal of the American Society for Information Science and Technology - Research Articles
Approximately detecting duplicates for streaming data using stable bloom filters
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Near-duplicate detection by instance-level constrained clustering
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Detecting near-duplicates for web crawling
Proceedings of the 16th international conference on World Wide Web
SpotSigs: robust and efficient near duplicate detection in large web collections
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Exploiting Sentence-Level Features for Near-Duplicate Document Detection
AIRS '09 Proceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology
New event detection and topic tracking in Turkish
Journal of the American Society for Information Science and Technology
Adaptive near-duplicate detection via similarity learning
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Efficient partial-duplicate detection based on sequence matching
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
The case of the duplicate documents measurement, search, and science
APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
Hi-index | 0.00 |
We study a generalized version of the near-duplicate detection problem which concerns whether a document is a subset of another document. In text-based applications, document containment can be observed in exact-duplicates, near-duplicates, or containments, where the first two are special cases of the third. We introduce a novel method, called CoDet, which focuses particularly on this problem, and compare its performance with four well-known near-duplicate detection methods (DSC, full fingerprinting, I-Match, and SimHash) that are adapted to containment detection. Our method is expandable to different domains, and especially suitable for streaming news. Experimental results show that CoDet effectively and efficiently produces remarkable results in detecting containments.