Copy detection mechanisms for digital documents
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Collection statistics for fast duplicate document detection
ACM Transactions on Information Systems (TOIS)
Identifying and Filtering Near-Duplicate Documents
COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
Efficient similarity search and classification via rank aggregation
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
On the Evolution of Clusters of Near-Duplicate Web Pages
LA-WEB '03 Proceedings of the First Conference on Latin American Web Congress
Online duplicate document detection: signature reliability in a dynamic retrieval environment
CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Near-duplicate detection for eRulemaking
dg.o '05 Proceedings of the 2005 national conference on Digital government research
Finding near-duplicate web pages: a large-scale evaluation of algorithms
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Near-duplicate detection by instance-level constrained clustering
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Efficient exact set-similarity joins
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Where and How Duplicates Occur in the Web
LA-WEB '06 Proceedings of the Fourth Latin American Web Congress
Scaling up all pairs similarity search
Proceedings of the 16th international conference on World Wide Web
Detecting near-duplicates for web crawling
Proceedings of the 16th international conference on World Wide Web
Spark: top-k keyword query in relational databases
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Multiple-signal duplicate detection for search evaluation
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
MapReduce: simplified data processing on large clusters
Communications of the ACM - 50th anniversary issue: 1958 - 2008
SpotSigs: robust and efficient near duplicate detection in large web collections
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Achieving both high precision and high recall in near-duplicate detection
Proceedings of the 17th ACM conference on Information and knowledge management
ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Exploiting Sentence-Level Features for Near-Duplicate Document Detection
AIRS '09 Proceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology
Detecting near-duplicates in large-scale short text databases
PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
Adaptive near-duplicate detection via similarity learning
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Efficient similarity joins for near-duplicate detection
ACM Transactions on Database Systems (TODS)
Semi-supervised SimHash for efficient document similarity search
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
A supervised machine learning approach for duplicate detection over gazetteer records
GeoS'11 Proceedings of the 4th international conference on GeoSpatial semantics
Probabilistic near-duplicate detection using simhash
Proceedings of the 20th ACM international conference on Information and knowledge management
Detection of near-duplicate user generated contents: the SMS spam collection
Proceedings of the 3rd international workshop on Search and mining user-generated contents
A Genetic Programming Approach to Record Deduplication
IEEE Transactions on Knowledge and Data Engineering
Efficient Exact Similarity Searches Using Multiple Token Orderings
ICDE '12 Proceedings of the 2012 IEEE 28th International Conference on Data Engineering
Cutting Plane Training for Linear Support Vector Machines
IEEE Transactions on Knowledge and Data Engineering
On Similarity Preserving Feature Selection
IEEE Transactions on Knowledge and Data Engineering
SBBS: A sliding blocking algorithm with backtracking sub-blocks for duplicate data detection
Expert Systems with Applications: An International Journal
Hi-index | 12.05 |
We present a novel method for detecting near-duplicates from a large collection of documents. Three major parts are involved in our method, feature selection, similarity measure, and discriminant derivation. To find near-duplicates to an input document, each sentence of the input document is fetched and preprocessed, the weight of each term is calculated, and the heavily weighted terms are selected to be the feature of the sentence. As a result, the input document is turned into a set of such features. A similarity measure is then applied and the similarity degree between the input document and each document in the given collection is computed. A support vector machine (SVM) is adopted to learn a discriminant function from a training pattern set, which is then employed to determine whether a document is a near-duplicate to the input document based on the similarity degree between them. The sentence-level features we adopt can better reveal the characteristics of a document. Besides, learning the discriminant function by SVM can avoid trial-and-error efforts required in conventional methods. Experimental results show that our method is effective in near-duplicate document detection.