METER: MEasuring TExt Reuse

Authors:
Paul Clough;Robert Gaizauskas;Scott S. L. Piao;Yorick Wilks
Affiliations:
University of Sheffield, Sheffield, England;University of Sheffield, Sheffield, England;University of Sheffield, Sheffield, England;University of Sheffield, Sheffield, England
Venue:
ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Year:
2002

Citing 7
Cited 12

YAP3: improved detection of similarities in computer program and other texts

SIGCSE '96 Proceedings of the twenty-seventh SIGCSE technical symposium on Computer science education
The decomposition of human-written summary sentences

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Data mining: practical machine learning tools and techniques with Java implementations

Data mining: practical machine learning tools and techniques with Java implementations
Feature Selection for Machine Learning: Comparing a Correlation-Based Filter Approach to the Wrapper

Proceedings of the Twelfth International Florida Artificial Intelligence Research Society Conference
A program for aligning sentences in bilingual corpora

Computational Linguistics - Special issue on using large corpora: I
Bitext maps and alignment via pattern recognition

Computational Linguistics
Aligning sentences in parallel corpora

ACL '91 Proceedings of the 29th annual meeting on Association for Computational Linguistics

Hitiqa: High-quality intelligence through interactive question answering

Natural Language Engineering
On the mono- and cross-language detection of text reuse and plagiarism

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Evaluating text reuse discovery on the web

Proceedings of the third symposium on Information interaction in context
Automatic detection of local reuse

EC-TEL'10 Proceedings of the 5th European conference on Technology enhanced learning conference on Sustaining TEL: from innovation to learning and practice
Towards document plagiarism detection based on the relevance and fragmentation of the reused text

MICAI'10 Proceedings of the 9th Mexican international conference on Advances in artificial intelligence: Part I
Developing a corpus of plagiarised short answers

Language Resources and Evaluation
Word length n-grams for text re-use detection

CICLing'10 Proceedings of the 11th international conference on Computational Linguistics and Intelligent Text Processing
Detecting text reuse with modified and weighted n-grams

SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation
UKP: computing semantic textual similarity by combining multiple content similarity measures

SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation
Determining and characterizing the reused text for plagiarism detection

Expert Systems with Applications: An International Journal
Folktale classification using learning to rank

ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
Paraphrase acquisition via crowdsourcing and machine learning

ACM Transactions on Intelligent Systems and Technology (TIST) - Special Sections on Paraphrasing; Intelligent Systems for Socially Aware Computing; Social Computing, Behavioral-Cultural Modeling, and Prediction

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we present results from the METER (MEasuring TExt Reuse) project whose aim is to explore issues pertaining to text reuse and derivation, especially in the context of newspapers using newswire sources. Although the reuse of text by journalists has been studied in linguistics, we are not aware of any investigation using existing computational methods for this particular task. We investigate the classification of newspaper articles according to their degree of dependence upon, or derivation from, a newswire source using a simple 3-level scheme designed by journalists. Three approaches to measuring text similarity are considered: n-gram overlap, Greedy String Tiling, and sentence alignment. Measured against a manually annotated corpus of source and derived news text, we show that a combined classifier with features automatically selected performs best overall for the ternary classification achieving an average F1-measure score of 0.664 across all three categories.