A Web-Based Self-training Approach for Authorship Attribution

  • Authors:
  • Rafael Guzmán-Cabrera;Manuel Montes-Y-Gómez;Paolo Rosso;Luis Villaseñor-Pineda

  • Affiliations:
  • FIMEE, Universidad de Guanajuato, México and NLE Lab, DSIC, Universidad Politécnica de Valencia, Spain;LabTL, Instituto Nacional de Astrofísica, Óptica y Electrónica, México;NLE Lab, DSIC, Universidad Politécnica de Valencia, Spain;LabTL, Instituto Nacional de Astrofísica, Óptica y Electrónica, México

  • Venue:
  • GoTAL '08 Proceedings of the 6th international conference on Advances in Natural Language Processing
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

As any other text categorization task, authorship attribution requires a large number of training examples. These examples, which are easily obtained for most of the tasks, are particularly difficult to obtain for this case. Based on this fact, in this paper we investigate the possibility of using Web-based text mining methods for the identification of the author of a given poem. In particular, we propose a semi-supervised method that is specially suited to work with justfew training examples in order to tackle the problem of the lack of data with the same writing style. The method considers the automatic extraction of the unlabeled examples from the Web and its iterative integration into the training data set. To the knowledge of the authors, a semi-supervised method which makes use of the Web as support lexical resource has not been previously employed in this task. The results obtained on poem categorization show that this method may improve the classification accuracy and it is appropriate to handle the attribution of short documents.