Digital Intuition: Applying Common Sense Using Dimensionality Reduction
IEEE Intelligent Systems
Predicting response to political blog posts with topic models
NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Scaling high-order character language models to gigabytes
Software '05 Proceedings of the Workshop on Software
Is it really about me?: message content in social awareness streams
Proceedings of the 2010 ACM conference on Computer supported cooperative work
Earthquake shakes Twitter users: real-time event detection by social sensors
Proceedings of the 19th international conference on World wide web
Uncovering social spammers: social honeypots + machine learning
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
An overview of Microsoft web N-gram corpus and applications
HLT-DEMO '10 Proceedings of the NAACL HLT 2010 Demonstration Session
Twitter based system: Using Twitter for disambiguating sentiment ambiguous adjectives
SemEval '10 Proceedings of the 5th International Workshop on Semantic Evaluation
WSA '10 Proceedings of the NAACL HLT 2010 Workshop on Computational Linguistics in a World of Social Media
Google web 1T 5-grams made easy (but not for the computer)
WAC-6 '10 Proceedings of the NAACL HLT 2010 Sixth Web as Corpus Workshop
Semi-supervised recognition of sarcastic sentences in Twitter and Amazon
CoNLL '10 Proceedings of the Fourteenth Conference on Computational Natural Language Learning
Who is tweeting on Twitter: human, bot, or cyborg?
Proceedings of the 26th Annual Computer Security Applications Conference
Sentiment knowledge discovery in twitter streaming data
DS'10 Proceedings of the 13th international conference on Discovery science
Patterns of temporal variation in online media
Proceedings of the fourth ACM international conference on Web search and data mining
Journal of the American Society for Information Science and Technology
Towards detecting influenza epidemics by analyzing Twitter messages
Proceedings of the First Workshop on Social Media Analytics
Comparing twitter and traditional media using topic models
ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
Part-of-speech tagging for Twitter: annotation, features, and experiments
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Identifying sarcasm in Twitter: a closer look
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Analyzing the dynamic evolution of hashtags on Twitter: a language-based approach
LSM '11 Proceedings of the Workshop on Languages in Social Media
Stereotypical gender actions can be extracted from web text
Journal of the American Society for Information Science and Technology
IEEE Transactions on Information Theory
Bad news travel fast: a content-based analysis of interestingness on Twitter
Proceedings of the 3rd International Web Science Conference
Hi-index | 0.00 |
Social media is a natural laboratory for linguistic and sociological purposes. In micro-blogging platforms such as Twitter, people share hundreds of millions of short messages about their lives and experiences on a daily basis. These messages, coupled with metadata about their authors, provide an opportunity to understand a wide variety of phenomena ranging from political polarization to geographic and demographic lexical variation. Lack of publicly available micro-blogging datasets has been a hindrance to replicable research. In this paper, I introduce Rovereto Twitter n-gram corpus, a publicly available n-gram dataset of Twitter messages, which contains gender-of-the-author and time-of-posting tags associated with the n-grams. I compare this dataset to a more traditional web-based corpus and present a case study which shows the potential of combining an n-gram corpus with demographic metadata.