Encryption and Secure Computer Networks
ACM Computing Surveys (CSUR)
Information finding in a digital library: the Stanford perspective
ACM SIGMOD Record
Building a scalable and accurate copy detection mechanism
Proceedings of the first ACM international conference on Digital libraries
Estimating alphanumeric selectivity in the presence of wildcards
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Indexing and retrieval of scientific literature
Proceedings of the eighth international conference on Information and knowledge management
CHECK: a document plagiarism detection system
SAC '97 Proceedings of the 1997 ACM symposium on Applied computing
Forensic engineering techniques for VLSI CAD tools
Proceedings of the 37th Annual Design Automation Conference
Copy detection for intellectual property protection of VLSI designs
ICCAD '99 Proceedings of the 1999 IEEE/ACM international conference on Computer-aided design
Proceedings of the 38th annual Design Automation Conference
A low-bandwidth network file system
SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Collection statistics for fast duplicate document detection
ACM Transactions on Information Systems (TOIS)
Watermarking of Electronic Text Documents
Electronic Commerce Research
Clustering for Approximate Similarity Search in High-Dimensional Spaces
IEEE Transactions on Knowledge and Data Engineering
Filtering with Approximate Predicates
VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Duplicate Removal in Information System Dissemination
VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Efficient Snapshot Differential Algorithms for Data Warehousing
VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Computational Forensic Techniques for Intellectual Property Protection
IHW '01 Proceedings of the 4th International Workshop on Information Hiding
Intellectual Property Metering
IHW '01 Proceedings of the 4th International Workshop on Information Hiding
Identifying and Filtering Near-Duplicate Documents
COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
Web Information Retrieval - an Algorithmic Perspective
ESA '00 Proceedings of the 8th Annual European Symposium on Algorithms
Fingerprinting Text in Logical Markup Languages
ISC '01 Proceedings of the 4th International Conference on Information Security
An Architecture of a Web-Based Collaborative Image Search Engine
On the Move to Meaningful Internet Systems, 2002 - DOA/CoopIS/ODBASE 2002 Confederated International Conferences DOA, CoopIS and ODBASE 2002
Methods for identifying versioned and plagiarized documents
Journal of the American Society for Information Science and Technology
Algorithmic aspects of information retrieval on the web
Handbook of massive data sets
Challenges in web search engines
ACM SIGIR Forum
Winnowing: local algorithms for document fingerprinting
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
TreeJuxtaposer: scalable tree comparison using Focus+Context with guaranteed visibility
ACM SIGGRAPH 2003 Papers
Online duplicate document detection: signature reliability in a dynamic retrieval environment
CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
What's new on the web?: the evolution of the web from a search engine perspective
Proceedings of the 13th international conference on World Wide Web
Plagiarism detection of text using knowledge-based techniques
Design and application of hybrid intelligent systems
Improved robustness of signature-based near-replica detection via lexicon randomization
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Fast Approximate Similarity Search in Extremely High-Dimensional Data Sets
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
WWW '05 Proceedings of the 14th international conference on World Wide Web
LSH forest: self-tuning indexes for similarity search
WWW '05 Proceedings of the 14th international conference on World Wide Web
Near-duplicate detection for eRulemaking
dg.o '05 Proceedings of the 2005 national conference on Digital government research
K-gram based software birthmarks
Proceedings of the 2005 ACM symposium on Applied computing
Finding similar files in large document repositories
Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Similarity measures for tracking information flow
Proceedings of the 14th ACM international conference on Information and knowledge management
Redundant documents and search effectiveness
Proceedings of the 14th ACM international conference on Information and knowledge management
Managing déjà vu: Collection building for the identification of nonidentical duplicate documents
Journal of the American Society for Information Science and Technology - Research Articles
WebKhoj: Indian language IR from multiple character encodings
Proceedings of the 15th international conference on World Wide Web
Next steps in near-duplicate detection for eRulemaking
dg.o '06 Proceedings of the 2006 international conference on Digital government research
Finding near-duplicate web pages: a large-scale evaluation of algorithms
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Near-duplicate detection by instance-level constrained clustering
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Just-in-time recovery of missing web pages
Proceedings of the seventeenth conference on Hypertext and hypermedia
A Dual-Method Model for Copy Detection
WI-IATW '06 Proceedings of the 2006 IEEE/WIC/ACM international conference on Web Intelligence and Intelligent Agent Technology
Improving duplicate elimination in storage systems
ACM Transactions on Storage (TOS)
Accurate discovery of co-derivative documents via duplicate text detection
Information Systems
Do not crawl in the dust: different urls with similar text
Proceedings of the 16th international conference on World Wide Web
Detecting near-duplicates for web crawling
Proceedings of the 16th international conference on World Wide Web
EPCI: extracting potentially copyright infringement texts from the web
Proceedings of the 16th international conference on World Wide Web
Deducing similarities in Java sources from bytecodes
ATEC '98 Proceedings of the annual conference on USENIX Annual Technical Conference
Distributed text retrieval from overlapping collections
ADC '07 Proceedings of the eighteenth conference on Australasian database - Volume 63
Essential deduplication functions for transactional databases in law firms
Proceedings of the 11th international conference on Artificial intelligence and law
Multiple-signal duplicate detection for search evaluation
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Combinatorial algorithms for web search engines: three success stories
SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
A Sketch Algorithm for Estimating Two-Way and Multi-Way Associations
Computational Linguistics
Computer-based plagiarism detection methods and tools: an overview
CompSysTech '07 Proceedings of the 2007 international conference on Computer systems and technologies
Avoiding the disk bottleneck in the data domain deduplication file system
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Measuring novelty and redundancy with multiple modalities in cross-lingual broadcast news
Computer Vision and Image Understanding
Generating links by mining quotations
Proceedings of the nineteenth ACM conference on Hypertext and hypermedia
SpotSigs: robust and efficient near duplicate detection in large web collections
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Improving web information indexing and retrieval based on center block duplication detection
International Journal of Innovative Computing and Applications
Lexicon randomization for near-duplicate detection with I-Match
The Journal of Supercomputing
Identifying Quotations in Reference Works and Primary Materials
ECDL '08 Proceedings of the 12th European conference on Research and Advanced Technology for Digital Libraries
Plagiarism Detection Based on Singular Value Decomposition
GoTAL '08 Proceedings of the 6th international conference on Advances in Natural Language Processing
Design of a P2P infrastructure to support plagiarism detection mechanisms
CSTST '08 Proceedings of the 5th international conference on Soft computing as transdisciplinary science and technology
Achieving both high precision and high recall in near-duplicate detection
Proceedings of the 17th ACM conference on Information and knowledge management
Do not crawl in the DUST: Different URLs with similar text
ACM Transactions on the Web (TWEB)
Sparse indexing: large scale, inline deduplication using sampling and locality
FAST '09 Proccedings of the 7th conference on File and storage technologies
Detecting the origin of text segments efficiently
Proceedings of the 18th international conference on World wide web
Efficient overlap and content reuse detection in blogs and online news articles
Proceedings of the 18th international conference on World wide web
The design of a similarity based deduplication system
SYSTOR '09 Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference
Applying syntactic similarity algorithms for enterprise information management
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Reengineering the Wikipedia for Reputation
Electronic Notes in Theoretical Computer Science (ENTCS)
Challenges in web search engines
IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Automatic retrieval of similar content using search engine query interface
Proceedings of the 18th ACM conference on Information and knowledge management
Multilayer SOM with tree-structured data for efficient document retrieval and plagiarism detection
IEEE Transactions on Neural Networks
Exploiting Sentence-Level Features for Near-Duplicate Document Detection
AIRS '09 Proceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology
Density analysis of winnowing on non-uniform distributions
APWeb/WAIM'07 Proceedings of the joint 9th Asia-Pacific web and 8th international conference on web-age information management conference on Advances in data and web management
Proceedings of the 19th international conference on World wide web
Differences and identities in document retrieval in an annotation environment
DNIS'07 Proceedings of the 5th international conference on Databases in networked information systems
Organizing news archives by near-duplicate copy detection in digital libraries
ICADL'07 Proceedings of the 10th international conference on Asian digital libraries: looking back 10 years and forging new frontiers
Detecting near-duplicates in large-scale short text databases
PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
Proceedings of the 21st ACM conference on Hypertext and hypermedia
Wiki trust metrics based on phrasal analysis
WikiSym '08 Proceedings of the 4th International Symposium on Wikis
I/O Deduplication: Utilizing content similarity to improve I/O performance
ACM Transactions on Storage (TOS)
Efficient privacy-preserving similar document detection
The VLDB Journal — The International Journal on Very Large Data Bases
A GPU accelerated storage system
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
I/O deduplication: utilizing content similarity to improve I/O performance
FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Bimodal content defined chunking for backup streams
FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
AISC '09 Proceedings of the Seventh Australasian Conference on Information Security - Volume 98
A coarse-to-fine framework to efficiently thwart plagiarism
Pattern Recognition
Finding inner copy communities using social network analysis
KES'10 Proceedings of the 14th international conference on Knowledge-based and intelligent information and engineering systems: Part II
Automatic detection of local reuse
EC-TEL'10 Proceedings of the 5th European conference on Technology enhanced learning conference on Sustaining TEL: from innovation to learning and practice
A framework for corroborating answers from multiple web sources
Information Systems
Facilitating interaction and retrieval for annotated documents
International Journal of Computational Science and Engineering
An evolutionary neural network approach to intrinsic plagiarism detection
AICS'09 Proceedings of the 20th Irish conference on Artificial intelligence and cognitive science
High throughput data redundancy removal algorithm with scalable performance
Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
Fixing the threshold for effective detection of near duplicate web documents in web crawling
ADMA'10 Proceedings of the 6th international conference on Advanced data mining and applications: Part I
Real-time approximate Range Motif discovery & data redundancy removal algorithm
Proceedings of the 14th International Conference on Extending Database Technology
Tradeoffs in scalable data routing for deduplication clusters
FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
Language Resources and Evaluation
Cross-language plagiarism detection
Language Resources and Evaluation
Developing a corpus of plagiarised short answers
Language Resources and Evaluation
Plagiarism detection among source codes using adaptive local alignment of keywords
Proceedings of the 5th International Conference on Ubiquitous Information Management and Communication
Foundations and Trends in Information Retrieval
A driver-layer caching policy for removable storage devices
ACM Transactions on Storage (TOS)
Reuse in the wild: an empirical and ethnographic study of organizational content reuse
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
On sociomaterial imbrications: What plagiarism detection systems reveal and why it matters
Information and Organization
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient similarity joins for near-duplicate detection
ACM Transactions on Database Systems (TODS)
Retrieving similar documents from the web
Journal of Web Engineering
SizeSpotSigs: an effective deduplicate algorithm considering the size of page content
PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
Proceedings of the 11th ACM symposium on Document engineering
A text copy detection system based on complex event processing architecture
ServiceWave'10 Proceedings of the 2010 international conference on Towards a service-based internet
Partial duplicate detection for large book collections
Proceedings of the 20th ACM international conference on Information and knowledge management
CoDet: sentence-based containment detection in news corpora
Proceedings of the 20th ACM international conference on Information and knowledge management
Mining relational structure from millions of books: position paper
Proceedings of the 4th ACM workshop on Online books, complementary social media and crowdsourcing
Detection of near-duplicate user generated contents: the SMS spam collection
Proceedings of the 3rd international workshop on Search and mining user-generated contents
Measuring redundancy level on the web
AINTEC '11 Proceedings of the 7th Asian Internet Engineering Conference
A systematic study of parameter correlations in large scale duplicate document detection
PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Enhancing duplicate collection detection through replica boundary discovery
PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Compact features for detection of near-duplicates in distributed retrieval
SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
The case of the duplicate documents measurement, search, and science
APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
Document copy detection system based on plagiarism patterns
CICLing'06 Proceedings of the 7th international conference on Computational Linguistics and Intelligent Text Processing
Using word clusters to detect similar web documents
KSEM'06 Proceedings of the First international conference on Knowledge Science, Engineering and Management
Temporal shingling for version identification in web archives
ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
A sentence-based copy detection approach for web documents
FSKD'05 Proceedings of the Second international conference on Fuzzy Systems and Knowledge Discovery - Volume Part I
PPChecker: plagiarism pattern checker in document copy detection
TSD'06 Proceedings of the 9th international conference on Text, Speech and Dialogue
Updating broken web links: An automatic recommendation system
Information Processing and Management: an International Journal
An improved plagiarism detection scheme based on semantic role labeling
Applied Soft Computing
Clustering near-identical sequences for fast homology search
RECOMB'06 Proceedings of the 10th annual international conference on Research in Computational Molecular Biology
Fast discovery of similar sequences in large genomic collections
ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval
Intrinsic plagiarism detection
ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval
Korean documents copy detection based on ferret
ICIC'11 Proceedings of the 7th international conference on Advanced Intelligent Computing
A fusion of algorithms in near duplicate document detection
PAKDD'11 Proceedings of the 15th international conference on New Frontiers in Applied Data Mining
iDedup: latency-aware, inline data deduplication for primary storage
FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Replica-aware caching for Web proxies
Computer Communications
Multi-resolution similarity hashing
Digital Investigation: The International Journal of Digital Forensics & Incident Response
Detecting quilted web pages at scale
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Analysis and extraction of sentence-level paraphrase sub-corpus in CS education
Proceedings of the 13th annual conference on Information technology education
WAN-optimized replication of backup datasets using stream-informed delta compression
ACM Transactions on Storage (TOS)
Measuring semantic relatedness using multilingual representations
SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation
Detecting text reuse with modified and weighted n-grams
SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation
Online plagiarism detection through exploiting lexical, syntactic, and semantic information
ACL '12 Proceedings of the ACL 2012 System Demonstrations
Probabilistic deduplication for cluster-based storage systems
Proceedings of the Third ACM Symposium on Cloud Computing
Learning to rank duplicate bug reports
Proceedings of the 21st ACM international conference on Information and knowledge management
A test collection to evaluate plagiarism by missing or incorrect references
CLEF'12 Proceedings of the Third international conference on Information Access Evaluation: multilinguality, multimodality, and visual analytics
Space savings and design considerations in variable length deduplication
ACM SIGOPS Operating Systems Review
Detecting near-duplicate documents using sentence-level features and supervised learning
Expert Systems with Applications: An International Journal
Robust plagiary detection using semantic compression augmented SHAPD
ICCCI'12 Proceedings of the 4th international conference on Computational Collective Intelligence: technologies and applications - Volume Part I
An application for plagiarized source code detection based on a parse tree kernel
Engineering Applications of Artificial Intelligence
Hi-index | 0.00 |
In a digital library system, documents are available in digital form and therefore are more easily copied and their copyrights are more easily violated. This is a very serious problem, as it discourages owners of valuable information from sharing it with authorized users. There are two main philosophies for addressing this problem: prevention and detection. The former actually makes unauthorized use of documents difficult or impossible while the latter makes it easier to discover such activity.In this paper we propose a system for registering documents and then detecting copies, either complete copies or partial copies. We describe algorithms for such detection, and metrics required for evaluating detection mechanisms (covering accuracy, efficiency, and security). We also describe a working prototype, called COPS, describe implementation issues, and present experimental results that suggest the proper settings for copy detection parameters.