Efficient plagiarism detection for large code repositories

Authors:
Steven Burrows;S. M. M. Tahaghoghi;Justin Zobel
Affiliations:
School of Computer Science and Information Technology, RMIT University, Melbourne, Australia;School of Computer Science and Information Technology, RMIT University, Melbourne, Australia;School of Computer Science and Information Technology, RMIT University, Melbourne, Australia
Venue:
Software—Practice & Experience
Year:
2007

Citing 20
Cited 11

Incremental updates of inverted lists for text document retrieval

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Real life information retrieval: a study of user queries on the Web

ACM SIGIR Forum
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Sim: a utility for detecting similarity in computer programs

SIGCSE '99 The proceedings of the thirtieth SIGCSE technical symposium on Computer science education
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Software for detecting suspected plagiarism: comparing structure and attribute-counting systems

ACSE '96 Proceedings of the 1st Australasian conference on Computer science education
Cheating and plagiarism: perceptions and practices of first year IT students

Proceedings of the 7th annual conference on Innovation and technology in computer science education
Modern Information Retrieval

Modern Information Retrieval
Compressing Inverted Files

Information Retrieval
Searching Large Lexicons for Partially Specified Terms using Compressed Inverted Files

VLDB '93 Proceedings of the 19th International Conference on Very Large Data Bases
Methods for identifying versioned and plagiarized documents

Journal of the American Society for Information Science and Technology
A plagiarism detection system

SIGCSE '81 Proceedings of the twelfth SIGCSE technical symposium on Computer science education
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
On finding duplication and near-duplication in large software systems

WCRE '95 Proceedings of the Second Working Conference on Reverse Engineering
Winnowing: local algorithms for document fingerprinting

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
In-place versus re-build versus re-merge: index maintenance strategies for text retrieval systems

ACSC '04 Proceedings of the 27th Australasian conference on Computer science - Volume 26
Efficient randomized pattern-matching algorithms

IBM Journal of Research and Development - Mathematics and computing
Plagiarism detection across programming languages

ACSC '06 Proceedings of the 29th Australasian Computer Science Conference - Volume 48
Deducing similarities in Java sources from bytecodes

ATEC '98 Proceedings of the annual conference on USENIX Annual Technical Conference
Fast plagiarism detection system

SPIRE'05 Proceedings of the 12th international conference on String Processing and Information Retrieval

Software development marketplaces: implications for plagiarism

ACE '07 Proceedings of the ninth Australasian conference on Computing education - Volume 66
Evolving similarity functions for code plagiarism detection

Proceedings of the 10th annual conference on Genetic and evolutionary computation
Application of Information Retrieval Techniques for Source Code Authorship Attribution

DASFAA '09 Proceedings of the 14th International Conference on Database Systems for Advanced Applications
A method for detecting the theft of Java programs through analysis of the control flow information

Information and Software Technology
Automatic detection of local reuse

EC-TEL'10 Proceedings of the 5th European conference on Technology enhanced learning conference on Sustaining TEL: from innovation to learning and practice
Obfuscating plagiarism detection: vulnerabilities and solutions

Proceedings of the 12th International Conference on Computer Systems and Technologies
Plagiarism detection for Java: a tool comparison

Computer Science Education Research Conference
AuDeNTES: Automatic Detection of teNtative plagiarism according to a rEference Solution

ACM Transactions on Computing Education (TOCE)
Fast plagiarism detection by sentence hashing

ICAISC'12 Proceedings of the 11th international conference on Artificial Intelligence and Soft Computing - Volume Part II
Robust plagiary detection using semantic compression augmented SHAPD

ICCCI'12 Proceedings of the 4th international conference on Computational Collective Intelligence: technologies and applications - Volume Part I
DroidLegacy: Automated Familial Classification of Android Malware

Proceedings of ACM SIGPLAN on Program Protection and Reverse Engineering Workshop 2014

Quantified Score

Hi-index	0.00

Visualization

Abstract

Unauthorized re-use of code by students is a widespread problem in academic institutions, and raises liability issues for industry. Manual plagiarism detection is time-consuming, and current effective plagiarism detection approaches cannot be easily scaled to very large code repositories. While there are practical text-based plagiarism detection systems capable of working with large collections, this is not the case for code-based plagiarism detection. In this paper, we propose techniques for detecting plagiarism in program code using text similarity measures and local alignment. Through detailed empirical evaluation on small and large collections of programs, we show that our approach is highly scalable while maintaining similar levels of effectiveness to that of the popular JPlag and MOSS systems. Copyright © 2006 John Wiley & Sons, Ltd.