Machine Learning for Author Affiliation within Web Forums -- Using Statistical Techniques on NLP Features for Online Group Identification

  • Authors:
  • Jeffrey Ellen;Shibin Parameswaran

  • Affiliations:
  • -;-

  • Venue:
  • ICMLA '11 Proceedings of the 2011 10th International Conference on Machine Learning and Applications and Workshops - Volume 01
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Although there have been previous studies performing authorship attribution to a specific individual, we find a shortage of efforts to group authors based on their affiliations. This paper presents our work on classification of website forum posts by the author's group affiliation. Specifically, we seek to classify translated website forum posts by the (inferred) political affiliation of the author. The two datasets that we attempt to classify consist of real-world data discussing current issues--Israeli/Palestinian dialogue (Bitter Lemons corpus) and translated Extremist/Moderate forum entries (from internet websites). To achieve our goal of reliable authorship affiliation, we extract term frequency-based features (that are conventional in document classification) along with less commonly used linguistic style-based features. The resulting set of stylometric features are then utilized in two widely used supervised classification algorithms, namely k-Nearest Neighbor algorithm and Support Vector Machines. Specifically, we used k-NN with cosine distance and Support Vector Machines with two different kernel functions. In addition to the popular RBF kernels, we also evaluate the applicability and performance of the recently introduced arc-cosine kernels for group affiliation. The results of our experiments show strong performance across a range of pertinent metrics.