Machine Learning
Using Literal and Grammatical Statistics for Authorship Attribution
Problems of Information Transmission
Mining e-mail content for author identification forensics
ACM SIGMOD Record
Authorship Attribution with Support Vector Machines
Applied Intelligence
Asymptotic behaviors of support vector machines with Gaussian kernel
Neural Computation
Language independent authorship attribution using character level language models
EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
Language morphology offset: Text classification on a Croatian-English parallel corpus
Information Processing and Management: an International Journal
Automatic acquisition of inflectional lexica for morphological normalisation
Information Processing and Management: an International Journal
A survey of modern authorship attribution methods
Journal of the American Society for Information Science and Technology
EPIA'07 Proceedings of the aritficial intelligence 13th Portuguese conference on Progress in artificial intelligence
Authorship attribution using word sequences
CIARP'06 Proceedings of the 11th Iberoamerican conference on Progress in Pattern Recognition, Image Analysis and Applications
Effective and scalable authorship attribution using function words
AIRS'05 Proceedings of the Second Asia conference on Asia Information Retrieval Technology
A comparative study of language models for book and author recognition
IJCNLP'05 Proceedings of the Second international joint conference on Natural Language Processing
Hi-index | 0.00 |
In this work we investigate the use of various character, lexical, and syntactic level features and their combinations in automatic authorship attribution. Since the majority of text representation features are language specific, we examine their application on texts written in Croatian language. Our work differs from the similar work in at least three aspects. Firstly, we use slightly different set of features than previously proposed. Secondly, we use four different data sets and compare the same features across those data sets to draw stronger conclusions. The data sets that we use consist of articles, blogs, books, and forum posts written in Croatian language. Finally, we employ a classification method based on a strong classifier. We use the Support Vector Machines to learn classifiers which achieve excellent results for longer texts: 91% accuracy and F1 measure for blogs, 93% acc. and F1 for articles, and 99% acc. and F1 for books. Experiments conducted on forum posts show that more complex features need to be employed for shorter texts.