Automatic Moderation of Online Discussion Sites

Authors:
Jean-Yves Delort;Bavani Arunasalam;Cecile Paris
Affiliations:
Macquarie University;University of Sydney;CSIRO-Information and Communication Technology (ICT) Centre, Information Engineering Laboratory, Sydney, Australia
Venue:
International Journal of Electronic Commerce
Year:
2011

Citing 20
Cited 0

The nature of statistical learning theory

The nature of statistical learning theory
Document overlap detection system for distributed digital libraries

DL '00 Proceedings of the fifth ACM conference on Digital libraries
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
Data Mining Techniques: For Marketing, Sales, and Customer Support

Data Mining Techniques: For Marketing, Sales, and Customer Support
Learning When Negative Examples Abound

ECML '97 Proceedings of the 9th European Conference on Machine Learning
Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Selecting the right interestingness measure for association patterns

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
PEBL: positive example based learning for Web page classification using SVM

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Avoiding ballot stuffing in eBay-like reputation systems

Proceedings of the 2005 ACM SIGCOMM workshop on Economics of peer-to-peer systems
Detecting spam web pages through content analysis

Proceedings of the 15th international conference on World Wide Web
Fighting Spam on Social Web Sites: A Survey of Approaches and Future Challenges

IEEE Internet Computing
Toward Spotting the Pedophile Telling victim from predator in text chats

ICSC '07 Proceedings of the International Conference on Semantic Computing
Combating web spam with trustrank

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Effective spam filtering: A single-class learning and ensemble approach

Decision Support Systems
Efficient overlap and content reuse detection in blogs and online news articles

Proceedings of the 18th international conference on World wide web
Learning to classify texts using positive and unlabeled data

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Ranking Comments on the Social Web

CSE '09 Proceedings of the 2009 International Conference on Computational Science and Engineering - Volume 04
Manipulation-resistant collaborative filtering systems

Proceedings of the third ACM conference on Recommender systems
On the effectiveness of IP reputation for spam filtering

COMSNETS'10 Proceedings of the 2nd international conference on COMmunication systems and NETworks
Support vector machines for spam categorization

IEEE Transactions on Neural Networks

Quantified Score

Hi-index	0.00

Visualization

Abstract

Online discussion sites are plagued with various types of unwanted content, such as spam and obscene or malicious messages. Prevention and detection-based techniques have been proposed to filter inappropriate content from online discussion sites. But, even though prevention techniques have been widely adopted, detection of inappropriate content remains mostly a manual task. Existing detection techniques, which are divided into rule-based and statistical techniques, suffer from various limitations. Rule-based techniques usually consist of manually crafted rules or blacklists of key words. Both are time-consuming to create and tend to generate many false-positives and false-negatives. Statistical techniques typically use corpora of labeled examples to train a classifier to tell "good" and "bad" messages apart. Although statistical techniques are generally more robust than rule-based techniques, they are difficult to deploy because of the prohibitive cost of manually labeling examples. In this paper we describe a novel classification technique to train a classifier from a partially labeled corpus and use it to moderate inappropriate content on online discussion sites. Partially labeled corpora are much easier to produce than completely labeled corpora, as they are made up only with unlabeled examples and examples labeled with a single class (e.g., "bad"). We implemented and tested this technique on a corpus of messages posted on a stock message board and compared it with two baseline techniques. Results show that our method outperforms the two baselines and that it can be used to significantly reduce the number of messages that need to be reviewed by human moderators.