Experiments with filtered detection of similar academic papers

Authors:
Yaakov HaCohen-Kerner;Aharon Tayeb
Affiliations:
Dept. of Computer Science, Jerusalem College of Technology, Jerusalem, Israel;Dept. of Computer Science, Jerusalem College of Technology, Jerusalem, Israel
Venue:
AIMSA'12 Proceedings of the 15th international conference on Artificial Intelligence: methodology, systems, and applications
Year:
2012

Citing 15
Cited 0

YAP3: improved detection of similarities in computer program and other texts

SIGCSE '96 Proceedings of the twenty-seventh SIGCSE technical symposium on Computer science education
A low-bandwidth network file system

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Comparison of Overlap Detection Techniques

ICCS '02 Proceedings of the International Conference on Computational Science-Part I
Methods for identifying versioned and plagiarized documents

Journal of the American Society for Information Science and Technology
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Winnowing: local algorithms for document fingerprinting

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Self-plagiarism in computer science

Communications of the ACM - Transforming China
Finding similar files in large document repositories

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Plagiarism Detection in arXiv

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Exploiting semantic role labeling, WordNet and Wikipedia for coreference resolution

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
Finding similar files in a large file system

WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
PDE4Java: Plagiarism Detection Engine for Java source code: a clustering approach

International Journal of Business Intelligence and Data Mining
Semantic role labeling for coreference resolution

EACL '06 Proceedings of the Eleventh Conference of the European Chapter of the Association for Computational Linguistics: Posters & Demonstrations
Detection of simple plagiarism in computer science papers

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Shared information and program plagiarism detection

IEEE Transactions on Information Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this research, we investigate the issue of efficient detection of similar academic papers. Given a specific paper, and a corpus of academic papers, most of the papers from the corpus are filtered out using a fast filter method. Then, 47 methods (baseline methods and combinations of them) are applied to detect similar papers, where 34 of the methods are variants of new methods. These 34 methods are divided into three new method sets: rare words, combinations of at least two methods, and compare methods between portions of the papers. Results achieved by some of the 34 heuristic methods are better than the results of previous heuristic methods, comparing to the results of the "Full Fingerprint" (FF) method, an expensive method that served as an expert. Nevertheless, the run time of the new methods is much more efficient than the run time of the FF method. The most interesting finding is a method called CWA(1) that computes the frequency of rare words that appear only once in both compared papers. This method has been found as an efficient measure to check whether two papers are similar.