A repetition based measure for verification of text collections and for text categorization

Authors:
Dmitry V. Khmelev;William J. Teahan
Affiliations:
Moscow State University;University of Wales, Bangor, Gwynedd LL57 1UT, Wales, UK
Venue:
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Year:
2003

Citing 8
Cited 15

Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms

Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms
Using Literal and Grammatical Statistics for Authorship Attribution

Problems of Information Transmission
Improving the Efficiency of the PPM Algorithm

Problems of Information Transmission
Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
On the Learnability and Design of Output Codes for Multiclass Problems

COLT '00 Proceedings of the Thirteenth Annual Conference on Computational Learning Theory
Text Categorization Using Compression Models

DCC '00 Proceedings of the Conference on Data Compression
Opportunistic data structures with applications

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science

RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
Verifying a Chinese collection for text categorization

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Context-based methods for text categorisation

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
On redundancy of training corpus for text categorization: a perspective of geometry

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Searching with style: authorship attribution in classic literature

ACSC '07 Proceedings of the thirtieth Australasian conference on Computer science - Volume 62
Text categorization for streams

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Generating Fuzzy Equivalence Classes on RSS News Articles for Retrieving Correlated Information

ICCSA '08 Proceedings of the international conference on Computational Science and Its Applications, Part II
Tensor Space Models for Authorship Identification

SETN '08 Proceedings of the 5th Hellenic conference on Artificial Intelligence: Theories, Models and Applications
A survey of modern authorship attribution methods

Journal of the American Society for Information Science and Technology
Forensic Authorship Attribution Using Compression Distances to Prototypes

IWCF '09 Proceedings of the 3rd International Workshop on Computational Forensics
SimPaD: A word-similarity sentence-based plagiarism detection tool on Web documents

Web Intelligence and Agent Systems
N-Gram feature selection for authorship identification

AIMSA'06 Proceedings of the 12th international conference on Artificial Intelligence: methodology, Systems, and Applications
On compression-based text classification

ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research
Semi-random subspace method for writeprint identification

Neurocomputing
Legal documents categorization by compression

Proceedings of the Fourteenth International Conference on Artificial Intelligence and Law

Quantified Score

Hi-index	0.00

Visualization

Abstract

We suggest a way for locating duplicates and plagiarisms in a text collection using an R-measure, which is the normalized sum of the lengths of all suffixes of the text repeated in other documents of the collection. The R-measure can be effectively computed using the suffix array data structure. Additionally, the computation procedure can be improved to locate the sets of duplicate or plagiarised documents. We applied the technique to several standard text collections and found that they contained a significant number of duplicate and plagiarised documents. Another reformulation of the method leads to an algorithm that can be applied to supervised multi-class categorization. We illustrate the approach using the recently available Reuters Corpus Volume 1 (RCV1). The results show that the method outperforms SVM at multi-class categorization, and interestingly, that results correlate strongly with compression-based methods.