Copy detection mechanisms for digital documents
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
A study of smoothing methods for language models applied to Ad Hoc information retrieval
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Collection statistics for fast duplicate document detection
ACM Transactions on Information Systems (TOIS)
ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
Clustering with Instance-level Constraints
ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Methods for identifying versioned and plagiarized documents
Journal of the American Society for Information Science and Technology
Online duplicate document detection: signature reliability in a dynamic retrieval environment
CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Improved robustness of signature-based near-replica detection via lexicon randomization
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Near-duplicate detection for eRulemaking
dg.o '05 Proceedings of the 2005 national conference on Digital government research
Similarity measures for tracking information flow
Proceedings of the 14th ACM international conference on Information and knowledge management
A bootstrapping approach for identifying stakeholders in public-comment corpora
dg.o '07 Proceedings of the 8th annual international conference on Digital government research: bridging disciplines & domains
Principles of hash-based text retrieval
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Active learning for e-rulemaking: public comment categorization
dg.o '08 Proceedings of the 2008 international conference on Digital government research
A study in rule-specific issue categorization for e-rulemaking
dg.o '08 Proceedings of the 2008 international conference on Digital government research
Ontology generation for large email collections
dg.o '08 Proceedings of the 2008 international conference on Digital government research
SpotSigs: robust and efficient near duplicate detection in large web collections
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Text plagiarism detection method based on path patterns
International Journal of Business Intelligence and Data Mining
Generating Fuzzy Equivalence Classes on RSS News Articles for Retrieving Correlated Information
ICCSA '08 Proceedings of the international conference on Computational Science and Its Applications, Part II
Achieving both high precision and high recall in near-duplicate detection
Proceedings of the 17th ACM conference on Information and knowledge management
Learning the distance metric in a personal ontology
Proceedings of the 2nd international workshop on Ontologies and information systems for the semantic web
Efficient overlap and content reuse detection in blogs and online news articles
Proceedings of the 18th international conference on World wide web
Automatic video tagging using content redundancy
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Avoiding Bias in Text Clustering Using Constrained K-means and May-Not-Links
ICTIR '09 Proceedings of the 2nd International Conference on Theory of Information Retrieval: Advances in Information Retrieval Theory
Exploiting Sentence-Level Features for Near-Duplicate Document Detection
AIRS '09 Proceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology
Granular Computing for Text Mining: New Research Challenges and Opportunities
RSFDGrC '09 Proceedings of the 12th International Conference on Rough Sets, Fuzzy Sets, Data Mining and Granular Computing
Fast approximate duplicate detection for 2D-NMR spectra
DILS'07 Proceedings of the 4th international conference on Data integration in the life sciences
Organizing news archives by near-duplicate copy detection in digital libraries
ICADL'07 Proceedings of the 10th international conference on Asian digital libraries: looking back 10 years and forging new frontiers
Efficient privacy-preserving similar document detection
The VLDB Journal — The International Journal on Very Large Data Bases
Exponential time improvement for min-wise based algorithms
Information and Computation
Language Resources and Evaluation
Content redundancy in YouTube and its application to video tagging
ACM Transactions on Information Systems (TOIS)
Document clustering with universum
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
No free lunch: brute force vs. locality-sensitive hashing for cross-lingual pairwise similarity
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
SizeSpotSigs: an effective deduplicate algorithm considering the size of page content
PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
CoDet: sentence-based containment detection in news corpora
Proceedings of the 20th ACM international conference on Information and knowledge management
Exponential time improvement for min-wise based algorithms
Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms
An experimental study of constrained clustering effectiveness in presence of erroneous constraints
Information Processing and Management: an International Journal
Detecting near-duplicate documents using sentence-level features and supervised learning
Expert Systems with Applications: An International Journal
Reassembling multilingual temporal news datasets with incomplete information
AusDM '11 Proceedings of the Ninth Australasian Data Mining Conference - Volume 121
Bottom-k and priority sampling, set similarity and subset sums with minimal independence
Proceedings of the forty-fifth annual ACM symposium on Theory of computing
Tweet acts: how constituents lobby congress via Twitter
Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing
Hi-index | 0.00 |
For the task of near-duplicated document detection, both traditional fingerprinting techniques used in database community and bag-of-word comparison approaches used in information retrieval community are not sufficiently accurate. This is due to the fact that the characteristics of near-duplicated documents are different from that of both "almost-identical" documents in the data cleaning task and "relevant" documents in the search task. This paper presents an instance-level constrained clustering approach for near-duplicate detection. The framework incorporates information such as document attributes and content structure into the clustering process to form near-duplicate clusters. Gathered from several collections of public comments sent to U.S. government agencies on proposed new regulations, the experimental results demonstrate that our approach outperforms other near-duplicate detection algorithms and as about as effective as human assessors.