Large-scale copy detection

Authors:
Xin Luna Dong;Divesh Srivastava
Affiliations:
AT&T Labs-Research, Florham Park, NJ, USA;AT&T Labs-Research, Florham Park, NJ, USA
Venue:
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Year:
2011

Citing 24
Cited 0

Copy detection mechanisms for digital documents

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
dSCAM: finding document copies across multiple databases

DIS '96 Proceedings of the fourth international conference on on Parallel and distributed information systems
Winnowing: local algorithms for document fingerprinting

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
A Frequency-based Approach for Mining Coverage Statistics in Data Integration

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Effectively Mining and Using Coverage and Overlap Statistics for Data Integration

IEEE Transactions on Knowledge and Data Engineering
Finding near-duplicate web pages: a large-scale evaluation of algorithms

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones

ICSE '07 Proceedings of the 29th international conference on Software Engineering
Video copy detection: a comparative study

Proceedings of the 6th ACM international conference on Image and video retrieval
Comparison and Evaluation of Clone Detection Tools

IEEE Transactions on Software Engineering
Local text reuse detection

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Empirical evaluation of clone detection using syntax suffix trees

Empirical Software Engineering
Finding text reuse on the web

Proceedings of the Second ACM International Conference on Web Search and Data Mining
Detecting the origin of text segments efficiently

Proceedings of the 18th international conference on World wide web
Comparison and evaluation of code clone detection techniques and tools: A qualitative approach

Science of Computer Programming
Integrating conflicting data: the role of source dependence

Proceedings of the VLDB Endowment
Truth discovery and copying detection in a dynamic world

Proceedings of the VLDB Endowment
Near-miss function clones in open source software: an empirical study

Journal of Software Maintenance and Evolution: Research and Practice - Working Conference on Reverse Engineering (WCRE 2008)
Probabilistic models to reconcile complex data from inaccurate data sources

CAiSE'10 Proceedings of the 22nd international conference on Advanced information systems engineering
Automatic detection of local reuse

EC-TEL'10 Proceedings of the 5th European conference on Technology enhanced learning conference on Sustaining TEL: from innovation to learning and practice
Global detection of complex copying relationships between sources

Proceedings of the VLDB Endowment
Detecting near-duplicate relations in user generated forum content

OTM'10 Proceedings of the 2010 international conference on On the move to meaningful internet systems
Query planning in the presence of overlapping sources

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
Shared information and program plagiarism detection

IEEE Transactions on Information Theory
Spatiotemporal sequence matching for efficient video copy detection

IEEE Transactions on Circuits and Systems for Video Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Web has enabled the availability of a vast amount of useful information in recent years. However, the web technologies that have enabled sources to share their information have also made it easy for sources to copy from each other and often publish without proper attribution. Understanding the copying relationships between sources has many benefits, including helping data providers protect their own rights, improving various aspects of data integration, and facilitating in-depth analysis of information flow. The importance of copy detection has led to a substantial amount of research in many disciplines of Computer Science, based on the type of information considered, such as text, images, videos, software code, and structured data. This tutorial explores the similarities and differences between the techniques proposed for copy detection across the different types of information. We also examine the computational challenges associated with large-scale copy detection, indicating how they could be detected efficiently, and identify a range of open problems for the community.