Using structural information and citation evidence to detect significant plagiarism cases in scientific publications

Authors:
Salha Alzahrani;Vasile Palade;Naomie Salim;Ajith Abraham
Affiliations:
Department of Computer Science, Taif University, Taif, Saudi Arabia;Department of Computer Science, University of Oxford, Oxford, UK;Faculty of Computer Science and Information Systems, Universiti Teknologi Malaysia, Johor Bahru, Johor, Malaysia;VSB Technical University of Ostrava, CZ
Venue:
Journal of the American Society for Information Science and Technology
Year:
2012

Citing 30
Cited 0

A similarity-based probability model for latent semantic indexing

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
visualising semantic spaces and author co-citation networks in digital libraries

Information Processing and Management: an International Journal - Special issue on progress toward digital libraries
Summarizing scientific articles: experiments with relevance and rhetorical status

Computational Linguistics - Summarization
Winnowing: local algorithms for document fingerprinting

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
AIDAS: Incremental Logical Structure Discovery in PDF Documents

ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
Logical Structure Analysis and Generation for Structured Documents: A Syntactic Approach

IEEE Transactions on Knowledge and Data Engineering
Sentence-based natural language plagiarism detection

Journal on Educational Resources in Computing (JERIC)
What's yours and what's mine: determining intellectual attribution in scientific text

EMNLP '00 Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13
Logical structure based semantic relationship extraction from semi-structured documents

Proceedings of the 15th international conference on World Wide Web
Using citations for ranking in digital libraries

Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Plagiarism Detection Based on Singular Value Decomposition

GoTAL '08 Proceedings of the 6th international conference on Advances in Natural Language Processing
Plagiarism Detection Using the Levenshtein Distance and Smith-Waterman Algorithm

ICICIC '08 Proceedings of the 2008 3rd International Conference on Innovative Computing Information and Control
Practical issues for academics using the Turnitin plagiarism detection software

CompSysTech '08 Proceedings of the 9th International Conference on Computer Systems and Technologies and Workshop for PhD Students in Computing
The toolbox for local and global plagiarism detection

Computers & Education
On Automatic Plagiarism Detection Based on n-Grams Comparison

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Text type structure and logical document structure

DiscAnnotation '04 Proceedings of the 2004 ACL Workshop on Discourse Annotation
Duplicate Detection in Documents and WebPages Using Improved Longest Common Subsequence and Documents Syntactical Structures

ICCIT '09 Proceedings of the 2009 Fourth International Conference on Computer Sciences and Convergence Information Technology
Automatic document structure detection for data integration

BIS'07 Proceedings of the 10th international conference on Business information systems
Enhancing document structure analysis using visual analytics

Proceedings of the 2010 ACM Symposium on Applied Computing
Efficient privacy-preserving similar document detection

The VLDB Journal — The International Journal on Very Large Data Bases
WINGNUS: Keyphrase extraction utilizing document logical structure

SemEval '10 Proceedings of the 5th International Workshop on Semantic Evaluation
Using structural information to improve search in Web collections

Journal of the American Society for Information Science and Technology
Mining citation information from CiteSeer data

Scientometrics
An evaluation framework for plagiarism detection

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Developing a corpus of plagiarised short answers

Language Resources and Evaluation
BibPro: A Citation Parser Based on Sequence Alignment

IEEE Transactions on Knowledge and Data Engineering
A new model of document structure analysis

FSKD'05 Proceedings of the Second international conference on Fuzzy Systems and Knowledge Discovery - Volume Part II
A sentence-based copy detection approach for web documents

FSKD'05 Proceedings of the Second international conference on Fuzzy Systems and Knowledge Discovery - Volume Part I
Word length n-grams for text re-use detection

CICLing'10 Proceedings of the 11th international conference on Computational Linguistics and Intelligent Text Processing
Understanding Plagiarism Linguistic Patterns, Textual Features, and Detection Methods

IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews

Quantified Score

Hi-index	0.00

Visualization

Abstract

In plagiarism detection (PD) systems, two important problems should be considered: the problem of retrieving candidate documents that are globally similar to a document q under investigation, and the problem of side-by-side comparison of q and its candidates to pinpoint plagiarized fragments in detail. In this article, the authors investigate the usage of structural information of scientific publications in both problems, and the consideration of citation evidence in the second problem. Three statistical measures namely Inverse Generic Class Frequency, Spread, and Depth are introduced to assign a degree of importance (i.e., weight) to structural components in scientific articles. A term-weighting scheme is adjusted to incorporate component-weight factors, which is used to improve the retrieval of potential sources of plagiarism. A plagiarism screening process is applied based on a measure of resemblance, in which component-weight factors are exploited to ignore less or nonsignificant plagiarism cases. Using the notion of citation evidence, parts with proper citation evidence are excluded, and remaining cases are suspected and used to calculate the similarity index. The authors compare their approach to two flat-based baselines, TF-IDF weighting with a Cosine coefficient, and shingling with a Jaccard coefficient. In both baselines, they use different comparison units with overlapping measures for plagiarism screening. They conducted extensive experiments using a dataset of 15,412 documents divided into 8,657 source publications and 6,755 suspicious queries, which included 18,147 plagiarism cases inserted automatically. Component-weight factors are assessed using precision, recall, and F-measure averaged over a 10-fold cross-validation and compared using the ANOVA statistical test. Results from structural-based candidate retrieval and plagiarism detection are evaluated statistically against the flat baselines using paired-t tests on 10-fold cross-validation runs, which demonstrate the efficacy achieved by the proposed framework. An empirical study on the system's response shows that structural information, unlike existing plagiarism detectors, helps to flag significant plagiarism cases, improve the similarity index, and provide human-like plagiarism screening results. © 2012 Wiley Periodicals, Inc.