Detecting offensive tweets via topical feature discovery over a large scale twitter corpus

Authors:
Guang Xiang;Bin Fan;Ling Wang;Jason Hong;Carolyn Rose
Affiliations:
Carnegie Mellon University, Pittsburgh, PA, USA;Carnegie Mellon University, Pittsburgh, PA, USA;Carnegie Mellon University, Pittsburgh, PA, USA;Carnegie Mellon University, Pittsburgh, PA, USA;Carnegie Mellon University, Pittsburgh, PA, USA
Venue:
Proceedings of the 21st ACM international conference on Information and knowledge management
Year:
2012

Citing 5
Cited 1

Latent dirichlet allocation

The Journal of Machine Learning Research
Smokey: automatic recognition of hostile messages

AAAI'97/IAAI'97 Proceedings of the fourteenth national conference on artificial intelligence and ninth conference on Innovative applications of artificial intelligence
Enhanced sentiment learning using Twitter hashtags and smileys

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Modeling of stylistic variation in social media with stretchy patterns

DIALECTS '11 Proceedings of the First Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties
Offensive language detection using multi-level classification

AI'10 Proceedings of the 23rd Canadian conference on Advances in Artificial Intelligence

Cursing in English on twitter

Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we propose a novel semi-supervised approach for detecting profanity-related offensive content in Twitter. Our approach exploits linguistic regularities in profane language via statistical topic modeling on a huge Twitter corpus, and detects offensive tweets using automatically these generated features. Our approach performs competitively with a variety of machine learning (ML) algorithms. For instance, our approach achieves a true positive rate (TP) of 75.1% over 4029 testing tweets using Logistic Regression, significantly outperforming the popular keyword matching baseline, which has a TP of 69.7%, while keeping the false positive rate (FP) at the same level as the baseline at about 3.77%. Our approach provides an alternative to large scale hand annotation efforts required by fully supervised learning approaches.