Suffix arrays: a new method for on-line string searches
SIAM Journal on Computing
Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms
Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms
Using Literal and Grammatical Statistics for Authorship Attribution
Problems of Information Transmission
Improving the Efficiency of the PPM Algorithm
Problems of Information Transmission
Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers
ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
On the Learnability and Design of Output Codes for Multiclass Problems
COLT '00 Proceedings of the Thirteenth Annual Conference on Computational Learning Theory
Text Categorization Using Compression Models
DCC '00 Proceedings of the Conference on Data Compression
Opportunistic data structures with applications
FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
RCV1: A New Benchmark Collection for Text Categorization Research
The Journal of Machine Learning Research
Verifying a Chinese collection for text categorization
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Context-based methods for text categorisation
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
On redundancy of training corpus for text categorization: a perspective of geometry
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Searching with style: authorship attribution in classic literature
ACSC '07 Proceedings of the thirtieth Australasian conference on Computer science - Volume 62
Text categorization for streams
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Generating Fuzzy Equivalence Classes on RSS News Articles for Retrieving Correlated Information
ICCSA '08 Proceedings of the international conference on Computational Science and Its Applications, Part II
Tensor Space Models for Authorship Identification
SETN '08 Proceedings of the 5th Hellenic conference on Artificial Intelligence: Theories, Models and Applications
A survey of modern authorship attribution methods
Journal of the American Society for Information Science and Technology
Forensic Authorship Attribution Using Compression Distances to Prototypes
IWCF '09 Proceedings of the 3rd International Workshop on Computational Forensics
SimPaD: A word-similarity sentence-based plagiarism detection tool on Web documents
Web Intelligence and Agent Systems
N-Gram feature selection for authorship identification
AIMSA'06 Proceedings of the 12th international conference on Artificial Intelligence: methodology, Systems, and Applications
On compression-based text classification
ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research
Semi-random subspace method for writeprint identification
Neurocomputing
Legal documents categorization by compression
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Law
Hi-index | 0.00 |
We suggest a way for locating duplicates and plagiarisms in a text collection using an R-measure, which is the normalized sum of the lengths of all suffixes of the text repeated in other documents of the collection. The R-measure can be effectively computed using the suffix array data structure. Additionally, the computation procedure can be improved to locate the sets of duplicate or plagiarised documents. We applied the technique to several standard text collections and found that they contained a significant number of duplicate and plagiarised documents. Another reformulation of the method leads to an algorithm that can be applied to supervised multi-class categorization. We illustrate the approach using the recently available Reuters Corpus Volume 1 (RCV1). The results show that the method outperforms SVM at multi-class categorization, and interestingly, that results correlate strongly with compression-based methods.