Stopword Graphs and Authorship Attribution in Text Corpora

  • Authors:
  • R. Arun;V. Suresh;C. E. Veni Madhavan

  • Affiliations:
  • -;-;-

  • Venue:
  • ICSC '09 Proceedings of the 2009 IEEE International Conference on Semantic Computing
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this work we identify interactions of stopwords -noisewords- in text corpora as a fundamental feature to effect author classification. It is convenient to view such interactions as graphs wherein nodes are stopwords and the interaction between a pair of stopwords are represented as edge-weights. We define the interaction in terms of the distances between pairs of stopwords in text documents. Given a list of authors, graphs for each author is computed based on their undisputed writings. Authorship of a test document is attributed based on the closeness of the graph derived from it to the above graphs. Towards this, we define a closeness measure to compare such graphs based on the Kullback-Leibler divergence. We illustrate the accuracy of our approach by applying it on examples drawn from the Gutenberg archives. Our results show that the proposed approach is effective not only in binary author classification but also performs multiclass author classification for as many as 10 authors at a time and compares favourably with the state-of-the-art in author identification.