Feature-rich part-of-speech tagging with a cyclic dependency network
NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Scalable training of L1-regularized log-linear models
Proceedings of the 24th international conference on Machine learning
Modeling latent biographic attributes in conversational genres
ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2
Classifying latent user attributes in twitter
SMUC '10 Proceedings of the 2nd international workshop on Search and mining user-generated contents
Predicting age and gender in online social networks
Proceedings of the 3rd international workshop on Search and mining user-generated contents
Personal User or Organizational User? Behavior on Microblog can Tell
ASONAM '12 Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012)
Aggregating Personal Health Messages for Scalable Comparative Effectiveness Research
Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics
Hi-index | 0.00 |
While the study of the connection between discourse patterns and personal identification is decades old, the study of these patterns using language technologies is relatively recent. In that more recent tradition we frame author age prediction from text as a regression problem. We explore the same task using three very different genres of data simultaneously: blogs, telephone conversations, and online forum posts. We employ a technique from domain adaptation that allows us to train a joint model involving all three corpora together as well as separately and analyze differences in predictive features across joint and corpus-specific aspects of the model. Effective features include both stylistic ones (such as POS patterns) as well as content oriented ones. Using a linear regression model based on shallow text features, we obtain correlations up to 0.74 and mean absolute errors between 4.1 and 6.8 years.