A weighted profile intersection measure for profile-based authorship attribution

  • Authors:
  • Hugo Jair Escalante;Manuel Montes-y-Gómez;Thamar Solorio

  • Affiliations:
  • Universidad Autónoma de Nuevo León, San Nicolas de los Garza, N. L., Mexico;Instituto Nacional de Astrofísica, Óptica y Electrónica, Tonantzintla, Puebla, Mexico;University of Alabama at Birmingham, Birmingham, AL

  • Venue:
  • MICAI'11 Proceedings of the 10th Mexican international conference on Advances in Artificial Intelligence - Volume Part I
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper introduces a new similarity measure called weighted profile intersection (WPI) for profile-based authorship attribution (PBAA). Authorship attribution (AA) is the task of determining which, from a set of candidate authors, wrote a given document. Under PBAA an author's profile is created by combining information extracted from sample documents written by the author of interest. An unseen document is associated with the author whose profile is most similar to the document. Although competitive performance has been obtained with PBAA, the method is limited in that the most used similarity measure only accounts for the number of overlapping terms among test documents and authors' profiles. We propose a new measure for PBAA, WPI, which takes into account an inter-author term penalization factor, besides the number of overlapping terms. Intuitively, in WPI we rely more on those terms that are (frequently) used by the author of interest and not (frequently) used by other authors when computing the similarity of the author's profile and a test document. We evaluate the proposed method in several AA data sets, including many data subsets from Twitter. Experimental results show that the proposed technique outperforms the standard PBAA method in all of the considered data sets; although the baseline method resulted very effective. Further, the proposed method achieves performance comparable to classifier-based AA methods (e.g., methods based on SVMs), which often obtain better classification results at the expense of limited interpretability and a higher computational cost.