Author gender identification from text

Authors:
Na Cheng;R. Chandramouli;K. P. Subbalakshmi
Affiliations:
Department of Electrical and Computer Engineering, Stevens Institute of Technology, Hoboken, NJ 07030, USA;Department of Electrical and Computer Engineering, Stevens Institute of Technology, Hoboken, NJ 07030, USA;Department of Electrical and Computer Engineering, Stevens Institute of Technology, Hoboken, NJ 07030, USA
Venue:
Digital Investigation: The International Journal of Digital Forensics & Incident Response
Year:
2011

Citing 6
Cited 6

A training algorithm for optimal margin classifiers

COLT '92 Proceedings of the fifth annual workshop on Computational learning theory
Support-Vector Networks

Machine Learning
Authorship Attribution with Support Vector Machines

Applied Intelligence
Applying Authorship Analysis to Extremist-Group Web Forum Messages

IEEE Intelligent Systems
A framework for authorship identification of online messages: Writing-style features and classification techniques

Journal of the American Society for Information Science and Technology
Visualizing authorship for identification

ISI'06 Proceedings of the 4th IEEE international conference on Intelligence and Security Informatics

Machine learning in computer forensics (and the lessons learned from machine learning in computer security)

Proceedings of the 4th ACM workshop on Security and artificial intelligence
A novel probabilistic feature selection method for text classification

Knowledge-Based Systems
Keystroke forensics: are you typing on a desktop or a laptop?

Proceedings of the 6th Balkan Conference in Informatics
Improving user profile with personality traits predicted from social media content

Proceedings of the 7th ACM conference on Recommender systems
The impact of preprocessing on text classification

Information Processing and Management: an International Journal
External validity of sentiment mining reports: Can current methods identify demographic biases, event biases, and manipulation of reviews?

Decision Support Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Text is still the most prevalent Internet media type. Examples of this include popular social networking applications such as Twitter, Craigslist, Facebook, etc. Other web applications such as e-mail, blog, chat rooms, etc. are also mostly text based. A question we address in this paper that deals with text based Internet forensics is the following: given a short text document, can we identify if the author is a man or a woman? This question is motivated by recent events where people faked their gender on the Internet. Note that this is different from the authorship attribution problem. In this paper we investigate author gender identification for short length, multi-genre, content-free text, such as the ones found in many Internet applications. Fundamental questions we ask are: do men and women inherently use different classes of language styles? If this is true, what are good linguistic features that indicate gender? Based on research in human psychology, we propose 545 psycho-linguistic and gender-preferential cues along with stylometric features to build the feature space for this identification problem. Note that identifying the correct set of features that indicate gender is an open research problem. Three machine learning algorithms (support vector machine, Bayesian logistic regression and AdaBoost decision tree) are then designed for gender identification based on the proposed features. Extensive experiments on large text corpora (Reuters Corpus Volume 1 newsgroup data and Enron e-mail data) indicate an accuracy up to 85.1% in identifying the gender. Experiments also indicate that function words, word-based features and structural features are significant gender discriminators.