Improving gender classification of blog authors

Authors:
Arjun Mukherjee;Bing Liu
Affiliations:
University of Illinois at Chicago, Chicago, IL;University of Illinois at Chicago, Chicago, IL
Venue:
EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Year:
2010

Citing 16
Cited 11

Selection of relevant features and examples in machine learning

Artificial Intelligence - Special issue on relevance
Wrappers for feature subset selection

Artificial Intelligence - Special issue on relevance
Gender differences in the perception and use of E-mail: an extension to the technology acceptance model

MIS Quarterly
Making large-scale support vector machine learning practical

Advances in kernel methods
High-performing feature selection for text classification

Proceedings of the eleventh international conference on Information and knowledge management
Feature Subset Selection in Text-Learning

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Mining Sequential Patterns: Generalizations and Performance Improvements

EDBT '96 Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology
Using LocalMaxs Algorithm for the Extraction of Contiguous and Non-contiguous Multiword Lexical Units

EPIA '99 Proceedings of the 9th Portuguese Conference on Artificial Intelligence: Progress in Artificial Intelligence
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Gender-Preferential Text Mining of E-mail Discourse

ACSAC '02 Proceedings of the 18th Annual Computer Security Applications Conference
An extensive empirical study of feature selection metrics for text classification

The Journal of Machine Learning Research
Bidirectional inference with the easiest-first strategy for tagging sequence data

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Feature subsumption for opinion analysis

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Feature Selection with Ensembles, Artificial Variables, and Redundancy Elimination

The Journal of Machine Learning Research
N-Gram feature selection for authorship identification

AIMSA'06 Proceedings of the 12th international conference on Artificial Intelligence: methodology, Systems, and Applications

Mark my words!: linguistic style accommodation in social media

Proceedings of the 20th international conference on World wide web
Gender attribution: tracing stylometric evidence beyond topic and genre

CoNLL '11 Proceedings of the Fifteenth Conference on Computational Natural Language Learning
Chameleons in imagined conversations: a new approach to understanding coordination of linguistic style in dialogs

CMCL '11 Proceedings of the 2nd Workshop on Cognitive Modeling and Computational Linguistics
Style analysis of academic writing

NLDB'11 Proceedings of the 16th international conference on Natural language processing and information systems
Predicting age and gender in online social networks

Proceedings of the 3rd international workshop on Search and mining user-generated contents
Modeling of stylistic variation in social media with stretchy patterns

DIALECTS '11 Proceedings of the First Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties
Discriminating gender on Twitter

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Using psycholinguistic features for profiling first language of authors

Journal of the American Society for Information Science and Technology
Construction and application of chinese emotional corpus

CLSW'12 Proceedings of the 13th Chinese conference on Chinese Lexical Semantics
Language independent gender classification on Twitter

Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
Recognition of understanding level and language skill using measurements of reading behavior

Proceedings of the 19th international conference on Intelligent User Interfaces

Quantified Score

Hi-index	0.00

Visualization

Abstract

The problem of automatically classifying the gender of a blog author has important applications in many commercial domains. Existing systems mainly use features such as words, word classes, and POS (part-of-speech) n-grams, for classification learning. In this paper, we propose two new techniques to improve the current result. The first technique introduces a new class of features which are variable length POS sequence patterns mined from the training data using a sequence pattern mining algorithm. The second technique is a new feature selection method which is based on an ensemble of several feature selection criteria and approaches. Empirical evaluation using a real-life blog data set shows that these two techniques improve the classification accuracy of the current state-of-the-art methods significantly.