Content-based similarity measures of weblog authors

  • Authors:
  • Christopher Wienberg;Melissa Roemmele;Andrew S. Gordon

  • Affiliations:
  • University of Southern California, Los Angeles, CA;University of Southern California, Los Angeles, CA;University of Southern California, Los Angeles, CA

  • Venue:
  • Proceedings of the 5th Annual ACM Web Science Conference
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

With recent research interest in the confounding roles of homophily and contagion in studies of social influence, there is a strong need for reliable content-based measures of the similarity between people. In this paper, we investigate the use of text similarity measures as a way of predicting the similarity of prolific weblog authors. We describe a novel method of collecting human judgments of overall similarity between two authors, as well as demographic, political, cultural, religious, values, hobbies/interests, personality, and writing style similarity. We then apply a range of automated textual similarity measures based on word frequency counts, and calculate their statistical correlation with human judgments. Our findings indicate that commonly used text similarity measures do not correlate well with human judgments of author similarity. However, various measures that pay special attention to personal pronouns and their context correlate significantly with different facets of similarity.