Copy detection mechanisms for digital documents
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
dSCAM: finding document copies across multiple databases
DIS '96 Proceedings of the fourth international conference on on Parallel and distributed information systems
Winnowing: local algorithms for document fingerprinting
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
A Frequency-based Approach for Mining Coverage Statistics in Data Integration
ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Effectively Mining and Using Coverage and Overlap Statistics for Data Integration
IEEE Transactions on Knowledge and Data Engineering
Finding near-duplicate web pages: a large-scale evaluation of algorithms
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones
ICSE '07 Proceedings of the 29th international conference on Software Engineering
Video copy detection: a comparative study
Proceedings of the 6th ACM international conference on Image and video retrieval
Comparison and Evaluation of Clone Detection Tools
IEEE Transactions on Software Engineering
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Empirical evaluation of clone detection using syntax suffix trees
Empirical Software Engineering
Proceedings of the Second ACM International Conference on Web Search and Data Mining
Detecting the origin of text segments efficiently
Proceedings of the 18th international conference on World wide web
Comparison and evaluation of code clone detection techniques and tools: A qualitative approach
Science of Computer Programming
Integrating conflicting data: the role of source dependence
Proceedings of the VLDB Endowment
Truth discovery and copying detection in a dynamic world
Proceedings of the VLDB Endowment
Near-miss function clones in open source software: an empirical study
Journal of Software Maintenance and Evolution: Research and Practice - Working Conference on Reverse Engineering (WCRE 2008)
Probabilistic models to reconcile complex data from inaccurate data sources
CAiSE'10 Proceedings of the 22nd international conference on Advanced information systems engineering
Automatic detection of local reuse
EC-TEL'10 Proceedings of the 5th European conference on Technology enhanced learning conference on Sustaining TEL: from innovation to learning and practice
Global detection of complex copying relationships between sources
Proceedings of the VLDB Endowment
Detecting near-duplicate relations in user generated forum content
OTM'10 Proceedings of the 2010 international conference on On the move to meaningful internet systems
Query planning in the presence of overlapping sources
EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
Shared information and program plagiarism detection
IEEE Transactions on Information Theory
Spatiotemporal sequence matching for efficient video copy detection
IEEE Transactions on Circuits and Systems for Video Technology
Hi-index | 0.00 |
The Web has enabled the availability of a vast amount of useful information in recent years. However, the web technologies that have enabled sources to share their information have also made it easy for sources to copy from each other and often publish without proper attribution. Understanding the copying relationships between sources has many benefits, including helping data providers protect their own rights, improving various aspects of data integration, and facilitating in-depth analysis of information flow. The importance of copy detection has led to a substantial amount of research in many disciplines of Computer Science, based on the type of information considered, such as text, images, videos, software code, and structured data. This tutorial explores the similarities and differences between the techniques proposed for copy detection across the different types of information. We also examine the computational challenges associated with large-scale copy detection, indicating how they could be detected efficiently, and identify a range of open problems for the community.