On the mono- and cross-language detection of text reuse and plagiarism

  • Authors:
  • Alberto Barrón-Cedeño

  • Affiliations:
  • Universidad Politécnica de Valencia, Valencia, Spain

  • Venue:
  • Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Plagiarism, the unacknowledged reuse of text, has increased in recent years due to the large amount of texts readily available. For instance, recent studies claim that nowadays a high rate of student reports include plagiarism, making manual plagiarism detection practically infeasible. Automatic plagiarism detection tools assist experts to analyse documents for plagiarism. Nevertheless, the lack of standard collections with cases of plagiarism has prevented accurate comparing models, making differences hard to appreciate. Seminal efforts on the detection of text reuse [2] have fostered the composition of standard resources for the accurate evaluation and comparison of methods. The aim of this PhD thesis is to address three of the main problems in the development of better models for automatic plagiarism detection: (i) the adequate identification of good potential sources for a given suspicious text; (ii) the detection of plagiarism despite modifications, such as words substitution and paraphrasing (special stress is given to cross-language plagiarism); and (iii) the generation of standard collections of cases of plagiarism and text reuse in order to provide a framework for accurate comparison of models. Regarding difficulties (i) and (ii) , we have carried out preliminary experiments over the METER corpus [2]. Given a suspicious document dq and a collection of potential source documents D, the process is divided in two steps. First, a small subset of potential source documents D* in D is retrieved. The documents d in D* are the most related to dq and, therefore, the most likely to include the source of the plagiarised fragments in it. We performed this stage on the basis of the Kullback-Leibler distance, over a subsample of document's vocabularies. Afterwards, a detailed analysis is carried out comparing dq to every d in D* in order to identify potential cases of plagiarism and their source. This comparison was made on the basis of word n-grams, by considering n = {2, 3}. These n-gram levels are flexible enough to properly retrieve plagiarised fragments and their sources despite modifications [1]. The result is offered to the user to take the final decision. Further experiments were done in both stages in order to compare other similarity measures, such as the cosine measure, the Jaccard coefficient and diverse fingerprinting and probabilistic models. One of the main weaknesses of currently available models is that they are unable to detect cross-language plagiarism. Approaching the detection of this kind of plagiarism is of high relevance, as the most of the information published is written in English, and authors in other languages may find it attractive to make use of direct translations. Our experiments, carried out over parallel and a comparable corpora, show that models of "standard" cross-language information retrieval are not enough. In fact, if the analysed source and target languages are related in some way (common linguistic ancestors or technical vocabulary), a simple comparison based on character n-grams seems to be the option. However, in those cases where the relation between the implied languages is weaker, other models, such as those based on statistical machine translation, are necessary [3]. We plan to perform further experiments, mainly to approach the detection of cross-language plagiarism. In order to do that, we will use the corpora developed under the framework of the PAN competition on plagiarism detection (cf. PAN@CLEF: http://pan.webis.de). Models that consider cross-language thesauri and comparison of cognates will also be applied.