Twitter n-gram corpus with demographic metadata

  • Authors:
  • Amaç Herdağdelen

  • Affiliations:
  • Facebook Inc., Menlo Park, USA 94025 and CIMeC, University of Trento, Rovereto, Italy

  • Venue:
  • Language Resources and Evaluation
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Social media is a natural laboratory for linguistic and sociological purposes. In micro-blogging platforms such as Twitter, people share hundreds of millions of short messages about their lives and experiences on a daily basis. These messages, coupled with metadata about their authors, provide an opportunity to understand a wide variety of phenomena ranging from political polarization to geographic and demographic lexical variation. Lack of publicly available micro-blogging datasets has been a hindrance to replicable research. In this paper, I introduce Rovereto Twitter n-gram corpus, a publicly available n-gram dataset of Twitter messages, which contains gender-of-the-author and time-of-posting tags associated with the n-grams. I compare this dataset to a more traditional web-based corpus and present a case study which shows the potential of combining an n-gram corpus with demographic metadata.