A training algorithm for optimal margin classifiers
COLT '92 Proceedings of the fifth annual workshop on Computational learning theory
Machine Learning
Authorship Attribution with Support Vector Machines
Applied Intelligence
Applying Authorship Analysis to Extremist-Group Web Forum Messages
IEEE Intelligent Systems
Journal of the American Society for Information Science and Technology
Visualizing authorship for identification
ISI'06 Proceedings of the 4th IEEE international conference on Intelligence and Security Informatics
Proceedings of the 4th ACM workshop on Security and artificial intelligence
A novel probabilistic feature selection method for text classification
Knowledge-Based Systems
Keystroke forensics: are you typing on a desktop or a laptop?
Proceedings of the 6th Balkan Conference in Informatics
Improving user profile with personality traits predicted from social media content
Proceedings of the 7th ACM conference on Recommender systems
The impact of preprocessing on text classification
Information Processing and Management: an International Journal
Hi-index | 0.00 |
Text is still the most prevalent Internet media type. Examples of this include popular social networking applications such as Twitter, Craigslist, Facebook, etc. Other web applications such as e-mail, blog, chat rooms, etc. are also mostly text based. A question we address in this paper that deals with text based Internet forensics is the following: given a short text document, can we identify if the author is a man or a woman? This question is motivated by recent events where people faked their gender on the Internet. Note that this is different from the authorship attribution problem. In this paper we investigate author gender identification for short length, multi-genre, content-free text, such as the ones found in many Internet applications. Fundamental questions we ask are: do men and women inherently use different classes of language styles? If this is true, what are good linguistic features that indicate gender? Based on research in human psychology, we propose 545 psycho-linguistic and gender-preferential cues along with stylometric features to build the feature space for this identification problem. Note that identifying the correct set of features that indicate gender is an open research problem. Three machine learning algorithms (support vector machine, Bayesian logistic regression and AdaBoost decision tree) are then designed for gender identification based on the proposed features. Extensive experiments on large text corpora (Reuters Corpus Volume 1 newsgroup data and Enron e-mail data) indicate an accuracy up to 85.1% in identifying the gender. Experiments also indicate that function words, word-based features and structural features are significant gender discriminators.