Approximate nearest neighbors: towards removing the curse of dimensionality
STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
Min-wise independent permutations
Journal of Computer and System Sciences - 30th annual ACM symposium on theory of computing
Collection statistics for fast duplicate document detection
ACM Transactions on Information Systems (TOIS)
Similarity estimation techniques from rounding algorithms
STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Similarity Search in High Dimensions via Hashing
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Identifying and Filtering Near-Duplicate Documents
COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
Methods for identifying versioned and plagiarized documents
Journal of the American Society for Information Science and Technology
On the Evolution of Clusters of Near-Duplicate Web Pages
LA-WEB '03 Proceedings of the First Conference on Latin American Web Congress
Improved robustness of signature-based near-replica detection via lexicon randomization
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Fighting Spam with Reputation Systems
Queue - Social Computing
Finding near-duplicate web pages: a large-scale evaluation of algorithms
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Detecting near-duplicates for web crawling
Proceedings of the 16th international conference on World Wide Web
Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions
Communications of the ACM - 50th anniversary issue: 1958 - 2008
SpotSigs: robust and efficient near duplicate detection in large web collections
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Lexicon randomization for near-duplicate detection with I-Match
The Journal of Supercomputing
Learning term-weighting functions for similarity measures
EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
Detecting duplicate web documents using clickthrough data
Proceedings of the fourth ACM international conference on Web search and data mining
A supervised method of feature weighting for measuring semantic relatedness
Canadian AI'11 Proceedings of the 24th Canadian conference on Advances in artificial intelligence
Partial duplicate detection for large book collections
Proceedings of the 20th ACM international conference on Information and knowledge management
CoDet: sentence-based containment detection in news corpora
Proceedings of the 20th ACM international conference on Information and knowledge management
Detection of near-duplicate user generated contents: the SMS spam collection
Proceedings of the 3rd international workshop on Search and mining user-generated contents
Clustering and load balancing optimization for redundant content removal
Proceedings of the 21st international conference companion on World Wide Web
On generating large-scale ground truth datasets for the deduplication of bibliographic records
Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics
Detecting near-duplicate documents using sentence-level features and supervised learning
Expert Systems with Applications: An International Journal
Optimizing parallel algorithms for all pairs similarity search
Proceedings of the sixth ACM international conference on Web search and data mining
Cache-conscious performance optimization for similarity search
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Hi-index | 0.00 |
In this paper, we present a novel near-duplicate document detection method that can easily be tuned for a particular domain. Our method represents each document as a real-valued sparse k-gram vector, where the weights are learned to optimize for a specified similarity function, such as the cosine similarity or the Jaccard coefficient. Near-duplicate documents can be reliably detected through this improved similarity measure. In addition, these vectors can be mapped to a small number of hash-values as document signatures through the locality sensitive hashing scheme for efficient similarity computation. We demonstrate our approach in two target domains: Web news articles and email messages. Our method is not only more accurate than the commonly used methods such as Shingles and I-Match, but also shows consistent improvement across the domains, which is a desired property lacked by existing methods.