The nature of statistical learning theory
The nature of statistical learning theory
Document overlap detection system for distributed digital libraries
DL '00 Proceedings of the fifth ACM conference on Digital libraries
Text Classification from Labeled and Unlabeled Documents using EM
Machine Learning - Special issue on information retrieval
Data Mining Techniques: For Marketing, Sales, and Customer Support
Data Mining Techniques: For Marketing, Sales, and Customer Support
Learning When Negative Examples Abound
ECML '97 Proceedings of the 9th European Conference on Machine Learning
Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval
ECML '98 Proceedings of the 10th European Conference on Machine Learning
Selecting the right interestingness measure for association patterns
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
PEBL: positive example based learning for Web page classification using SVM
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Avoiding ballot stuffing in eBay-like reputation systems
Proceedings of the 2005 ACM SIGCOMM workshop on Economics of peer-to-peer systems
Detecting spam web pages through content analysis
Proceedings of the 15th international conference on World Wide Web
Fighting Spam on Social Web Sites: A Survey of Approaches and Future Challenges
IEEE Internet Computing
Toward Spotting the Pedophile Telling victim from predator in text chats
ICSC '07 Proceedings of the International Conference on Semantic Computing
Combating web spam with trustrank
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Effective spam filtering: A single-class learning and ensemble approach
Decision Support Systems
Efficient overlap and content reuse detection in blogs and online news articles
Proceedings of the 18th international conference on World wide web
Learning to classify texts using positive and unlabeled data
IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Ranking Comments on the Social Web
CSE '09 Proceedings of the 2009 International Conference on Computational Science and Engineering - Volume 04
Manipulation-resistant collaborative filtering systems
Proceedings of the third ACM conference on Recommender systems
On the effectiveness of IP reputation for spam filtering
COMSNETS'10 Proceedings of the 2nd international conference on COMmunication systems and NETworks
Support vector machines for spam categorization
IEEE Transactions on Neural Networks
Hi-index | 0.00 |
Online discussion sites are plagued with various types of unwanted content, such as spam and obscene or malicious messages. Prevention and detection-based techniques have been proposed to filter inappropriate content from online discussion sites. But, even though prevention techniques have been widely adopted, detection of inappropriate content remains mostly a manual task. Existing detection techniques, which are divided into rule-based and statistical techniques, suffer from various limitations. Rule-based techniques usually consist of manually crafted rules or blacklists of key words. Both are time-consuming to create and tend to generate many false-positives and false-negatives. Statistical techniques typically use corpora of labeled examples to train a classifier to tell "good" and "bad" messages apart. Although statistical techniques are generally more robust than rule-based techniques, they are difficult to deploy because of the prohibitive cost of manually labeling examples. In this paper we describe a novel classification technique to train a classifier from a partially labeled corpus and use it to moderate inappropriate content on online discussion sites. Partially labeled corpora are much easier to produce than completely labeled corpora, as they are made up only with unlabeled examples and examples labeled with a single class (e.g., "bad"). We implemented and tested this technique on a corpus of messages posted on a stock message board and compared it with two baseline techniques. Results show that our method outperforms the two baselines and that it can be used to significantly reduce the number of messages that need to be reviewed by human moderators.