Intrinsic plagiarism analysis

Authors:
Benno Stein;Nedim Lipka;Peter Prettenhofer
Affiliations:
Faculty of Media, Media Systems, Bauhaus-Universität Weimar, Weimar, Germany 99421;Faculty of Media, Media Systems, Bauhaus-Universität Weimar, Weimar, Germany 99421;Faculty of Media, Media Systems, Bauhaus-Universität Weimar, Weimar, Germany 99421
Venue:
Language Resources and Evaluation
Year:
2011

Citing 35
Cited 8

Artificial intelligence: a modern approach

Artificial intelligence: a modern approach
Introduction to knowledge systems

Introduction to knowledge systems
Copy detection mechanisms for digital documents

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Two algorithms for nearest-neighbor search in high dimensions

STOC '97 Proceedings of the twenty-ninth annual ACM symposium on Theory of computing
Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Signature extraction for overlap detection in documents

ACSC '02 Proceedings of the twenty-fifth Australasian conference on Computer science - Volume 4
Constructing Boosting Algorithms from SVMs: An Application to One-Class Classification

IEEE Transactions on Pattern Analysis and Machine Intelligence
Similarity Search in High Dimensions via Hashing

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Combining One-Class Classifiers

MCS '01 Proceedings of the Second International Workshop on Multiple Classifier Systems
Methods for identifying versioned and plagiarized documents

Journal of the American Society for Information Science and Technology
Topic segmentation: algorithms and applications

Topic segmentation: algorithms and applications
One-class svms for document classification

The Journal of Machine Learning Research
Style mining of electronic messages for multiple authorship discrimination: first results

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Advances in domain independent linear text segmentation

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Authorship verification as a one-class classification problem

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Segmenting documents by stylistic character

Natural Language Engineering
A framework for authorship identification of online messages: Writing-style features and classification techniques

Journal of the American Society for Information Science and Technology
Finding near-duplicate web pages: a large-scale evaluation of algorithms

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Near-duplicate detection by instance-level constrained clustering

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Authorship attribution with thousands of candidate authors

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
On Authorship Attribution via Markov Chains and Sequence Kernels

ICPR '06 Proceedings of the 18th International Conference on Pattern Recognition - Volume 03
Author verification by linguistic profiling: An exploration of the parameter space

ACM Transactions on Speech and Language Processing (TSLP)
Linguistic profiling for author recognition and verification

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Obfuscating document stylometry to preserve author anonymity

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Principles of hash-based text retrieval

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Author Identification Using Imbalanced and Limited Training Texts

DEXA '07 Proceedings of the 18th International Conference on Database and Expert Systems Applications
Measuring Differentiability: Unmasking Pseudonymous Authors

The Journal of Machine Learning Research
Authorship attribution

Foundations and Trends in Information Retrieval
Meta Analysis within Authorship Verification

DEXA '08 Proceedings of the 2008 19th International Conference on Database and Expert Systems Application
Computational methods in authorship attribution

Journal of the American Society for Information Science and Technology
A survey of modern authorship attribution methods

Journal of the American Society for Information Science and Technology
Short text authorship attribution via sequence kernels, Markov chains and author unmasking: an investigation

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
Indexing shared content in information retrieval systems

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
Intrinsic plagiarism detection

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval

Comparative evaluation of text- and citation-based plagiarism detection approaches using guttenplag

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Citation pattern matching algorithms for citation-based plagiarism detection: greedy citation tiling, citation chunking and longest common citation sequence

Proceedings of the 11th ACM symposium on Document engineering
Detection of text quality flaws as a one-class classification problem

Proceedings of the 20th ACM international conference on Information and knowledge management
Detection of near-duplicate user generated contents: the SMS spam collection

Proceedings of the 3rd international workshop on Search and mining user-generated contents
Text mining applied to plagiarism detection: The use of words for detecting deviations in the writing style

Expert Systems with Applications: An International Journal
Explanation in computational stylometry

CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume 2
Detecting machine-morphed malware variants via engine attribution

Journal in Computer Virology
Plagiarism meets paraphrasing: Insights for the next generation in automatic plagiarism detection

Computational Linguistics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Research in automatic text plagiarism detection focuses on algorithms that compare suspicious documents against a collection of reference documents. Recent approaches perform well in identifying copied or modified foreign sections, but they assume a closed world where a reference collection is given. This article investigates the question whether plagiarism can be detected by a computer program if no reference can be provided, e.g., if the foreign sections stem from a book that is not available in digital form. We call this problem class intrinsic plagiarism analysis; it is closely related to the problem of authorship verification. Our contributions are threefold. (1) We organize the algorithmic building blocks for intrinsic plagiarism analysis and authorship verification and survey the state of the art. (2) We show how the meta learning approach of Koppel and Schler, termed "unmasking", can be employed to post-process unreliable stylometric analysis results. (3) We operationalize and evaluate an analysis chain that combines document chunking, style model computation, one-class classification, and meta learning.