Foundations of statistical natural language processing
Foundations of statistical natural language processing
Language and the Internet
LIBLINEAR: A Library for Large Linear Classification
The Journal of Machine Learning Research
EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
ICSC '09 Proceedings of the 2009 IEEE International Conference on Semantic Computing
Improving gender classification of blog authors
EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Gender attribution: tracing stylometric evidence beyond topic and genre
CoNLL '11 Proceedings of the Fifteenth Conference on Computational Natural Language Learning
Author age prediction from text using linear regression
LaTeCH '11 Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities
Overview of the third international workshop on search and mining user-generated contents
Proceedings of the 20th ACM international conference on Information and knowledge management
Inferring personal traits from music listening history
Proceedings of the second international ACM workshop on Music information retrieval with user-centered and multimodal strategies
On the impact of sentiment and emotion based features in detecting online sexual predators
WASSA '12 Proceedings of the 3rd Workshop in Computational Approaches to Subjectivity and Sentiment Analysis
Towards detection of child sexual abuse media: categorization of the associated filenames
ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
Explanation in computational stylometry
CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume 2
Language independent gender classification on Twitter
Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
Exploring high-level features for detecting cyberpedophilia
Computer Speech and Language
Hi-index | 0.00 |
A common characteristic of communication on online social networks is that it happens via short messages, often using non-standard language variations. These characteristics make this type of text a challenging text genre for natural language processing. Moreover, in these digital communities it is easy to provide a false name, age, gender and location in order to hide one's true identity, providing criminals such as pedophiles with new possibilities to groom their victims. It would therefore be useful if user profiles can be checked on the basis of text analysis, and false profiles flagged for monitoring. This paper presents an exploratory study in which we apply a text categorization approach for the prediction of age and gender on a corpus of chat texts, which we collected from the Belgian social networking site Netlog. We examine which types of features are most informative for a reliable prediction of age and gender on this difficult text type and perform experiments with different data set sizes in order to acquire more insight into the minimum data size requirements for this task.