Cross-language plagiarism detection

Authors:
Martin Potthast;Alberto Barrón-Cedeño;Benno Stein;Paolo Rosso
Affiliations:
Web Technology and Information Systems (Webis), Bauhaus-Universität Weimar, Weimar, Germany;Natural Language Engineering Lab, ELiRF, Universidad Politécnica de Valencia, Valencia, Spain;Web Technology and Information Systems (Webis), Bauhaus-Universität Weimar, Weimar, Germany;Natural Language Engineering Lab, ELiRF, Universidad Politécnica de Valencia, Valencia, Spain
Venue:
Language Resources and Evaluation
Year:
2011

Citing 18
Cited 13

Copy detection mechanisms for digital documents

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Translingual information retrieval: learning from bilingual corpora

Artificial Intelligence - Special issue: artificial intelligence 40 years later
Information retrieval as statistical translation

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Methods for identifying versioned and plagiarized documents

Journal of the American Society for Information Science and Technology
A systematic comparison of various statistical alignment models

Computational Linguistics
Resolving ambiguity for cross-language information retrieval: a dictionary approach

Resolving ambiguity for cross-language information retrieval: a dictionary approach
Character N-Gram Tokenization for European Language Text Retrieval

Information Retrieval
Dictionary-based techniques for cross-language information retrieval

Information Processing and Management: an International Journal - Special issue: Cross-language information retrieval
Principles of hash-based text retrieval

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Strategies for retrieving plagiarized documents

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Wikipedia in the pocket: indexing technology for near-duplicate detection and high similarity search

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Multilingual Plagiarism Detection

AIMSA '08 Proceedings of the 13th international conference on Artificial Intelligence: Methodology, Systems, and Applications
A statistical approach to crosslingual natural language tasks

Journal of Algorithms
Computing semantic relatedness using Wikipedia-based explicit semantic analysis

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Collection-Relative Representations: A Unifying View to Retrieval Models

DEXA '09 Proceedings of the 2009 20th International Workshop on Database and Expert Systems Application
Using query-relevant documents pairs for cross-lingual information retrieval

TSD'07 Proceedings of the 10th international conference on Text, speech and dialogue
A Wikipedia-based multilingual retrieval model

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
Intrinsic plagiarism detection

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval

On the mono- and cross-language detection of text reuse and plagiarism

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Plagiarism detection across distant language pairs

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
A new approach for cross-language plagiarism analysis

CLEF'10 Proceedings of the 2010 international conference on Multilingual and multimodal information access evaluation: cross-language evaluation forum
Towards the detection of cross-language source code reuse

NLDB'11 Proceedings of the 16th international conference on Natural language processing and information systems
Plagiarism detection based on structural information

Proceedings of the 20th ACM international conference on Information and knowledge management
Detection of near-duplicate user generated contents: the SMS spam collection

Proceedings of the 3rd international workshop on Search and mining user-generated contents
Exploiting Wikipedia for cross-lingual and multilingual information retrieval

Data & Knowledge Engineering
Finding translations in scanned book collections

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Text reuse with ACL: (upward) trends

ACL '12 Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries
Cross-Language high similarity search using a conceptual thesaurus

CLEF'12 Proceedings of the Third international conference on Information Access Evaluation: multilinguality, multimodality, and visual analytics
Cross-Language plagiarism detection using a multilingual semantic network

ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
Identifying useful human correction feedback from an on-line machine translation service

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Plagiarism meets paraphrasing: Insights for the next generation in automatic plagiarism detection

Computational Linguistics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Cross-language plagiarism detection deals with the automatic identification and extraction of plagiarism in a multilingual setting. In this setting, a suspicious document is given, and the task is to retrieve all sections from the document that originate from a large, multilingual document collection. Our contributions in this field are as follows: (1) a comprehensive retrieval process for cross-language plagiarism detection is introduced, highlighting the differences to monolingual plagiarism detection, (2) state-of-the-art solutions for two important subtasks are reviewed, (3) retrieval models for the assessment of cross-language similarity are surveyed, and, (4) the three models CL-CNG, CL-ESA and CL-ASA are compared. Our evaluation is of realistic scale: it relies on 120,000 test documents which are selected from the corpora JRC-Acquis and Wikipedia, so that for each test document highly similar documents are available in all of the six languages English, German, Spanish, French, Dutch, and Polish. The models are employed in a series of ranking tasks, and more than 100 million similarities are computed with each model. The results of our evaluation indicate that CL-CNG, despite its simple approach, is the best choice to rank and compare texts across languages if they are syntactically related. CL-ESA almost matches the performance of CL-CNG, but on arbitrary pairs of languages. CL-ASA works best on "exact" translations but does not generalize well.