Twitter n-gram corpus with demographic metadata

Authors:
Amaç Herdağdelen
Affiliations:
Facebook Inc., Menlo Park, USA 94025 and CIMeC, University of Trento, Rovereto, Italy
Venue:
Language Resources and Evaluation
Year:
2013

Citing 24
Cited 0

Digital Intuition: Applying Common Sense Using Dimensionality Reduction

IEEE Intelligent Systems
Predicting response to political blog posts with topic models

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Scaling high-order character language models to gigabytes

Software '05 Proceedings of the Workshop on Software
Is it really about me?: message content in social awareness streams

Proceedings of the 2010 ACM conference on Computer supported cooperative work
Earthquake shakes Twitter users: real-time event detection by social sensors

Proceedings of the 19th international conference on World wide web
Uncovering social spammers: social honeypots + machine learning

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
An overview of Microsoft web N-gram corpus and applications

HLT-DEMO '10 Proceedings of the NAACL HLT 2010 Demonstration Session
Twitter based system: Using Twitter for disambiguating sentiment ambiguous adjectives

SemEval '10 Proceedings of the 5th International Workshop on Semantic Evaluation
The Edinburgh Twitter corpus

WSA '10 Proceedings of the NAACL HLT 2010 Workshop on Computational Linguistics in a World of Social Media
Google web 1T 5-grams made easy (but not for the computer)

WAC-6 '10 Proceedings of the NAACL HLT 2010 Sixth Web as Corpus Workshop
Semi-supervised recognition of sarcastic sentences in Twitter and Amazon

CoNLL '10 Proceedings of the Fourteenth Conference on Computational Natural Language Learning
Who is tweeting on Twitter: human, bot, or cyborg?

Proceedings of the 26th Annual Computer Security Applications Conference
Sentiment knowledge discovery in twitter streaming data

DS'10 Proceedings of the 13th international conference on Discovery science
Patterns of temporal variation in online media

Proceedings of the fourth ACM international conference on Web search and data mining
Sentiment in Twitter events

Journal of the American Society for Information Science and Technology
Towards detecting influenza epidemics by analyzing Twitter messages

Proceedings of the First Workshop on Social Media Analytics
Comparing twitter and traditional media using topic models

ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
Part-of-speech tagging for Twitter: annotation, features, and experiments

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Insertion, deletion, or substitution?: normalizing text messages without pre-categorization nor supervision

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Identifying sarcasm in Twitter: a closer look

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Analyzing the dynamic evolution of hashtags on Twitter: a language-based approach

LSM '11 Proceedings of the Workshop on Languages in Social Media
Stereotypical gender actions can be extracted from web text

Journal of the American Society for Information Science and Technology
The zero-frequency problem: estimating the probabilities of novel events in adaptive text compression

IEEE Transactions on Information Theory
Bad news travel fast: a content-based analysis of interestingness on Twitter

Proceedings of the 3rd International Web Science Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

Social media is a natural laboratory for linguistic and sociological purposes. In micro-blogging platforms such as Twitter, people share hundreds of millions of short messages about their lives and experiences on a daily basis. These messages, coupled with metadata about their authors, provide an opportunity to understand a wide variety of phenomena ranging from political polarization to geographic and demographic lexical variation. Lack of publicly available micro-blogging datasets has been a hindrance to replicable research. In this paper, I introduce Rovereto Twitter n-gram corpus, a publicly available n-gram dataset of Twitter messages, which contains gender-of-the-author and time-of-posting tags associated with the n-grams. I compare this dataset to a more traditional web-based corpus and present a case study which shows the potential of combining an n-gram corpus with demographic metadata.