Filtering harmful sentences based on three-word co-occurrence

Authors:
Yutaro Fujii;Takuya Yoshimura;Takayuki Ito
Affiliations:
Institute of Technology, Gokiso-cho, Showa-ku, Nagoya, fujii;Nagoya Institute of Technology, Gokiso-cho, Showa-ku, Nagoya;Institute, University of Tokyo, Gokiso-cho, Showa-ku, Nagoya
Venue:
Proceedings of the 8th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference
Year:
2011

Citing 10
Cited 0

Toward memory-based reasoning

Communications of the ACM - Special issue on parallelism
Computer systems that learn: classification and prediction methods from statistics, neural nets, machine learning, and expert systems

Computer systems that learn: classification and prediction methods from statistics, neural nets, machine learning, and expert systems
Support-Vector Networks

Machine Learning
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Bigtable: A Distributed Storage System for Structured Data

ACM Transactions on Computer Systems (TOCS)
Filtering Harmful Sentences Based on Multiple Word Co-occurrence

ICIS '10 Proceedings of the 2010 IEEE/ACIS 9th International Conference on Computer and Information Science
Proposal of impression mining from news articles

KES'05 Proceedings of the 9th international conference on Knowledge-Based Intelligent Information and Engineering Systems - Volume Part I
Support vector machines for spam categorization

IEEE Transactions on Neural Networks

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many bulletin board systems and social network services have become popular collaboration tools in recent years. In such systems, users can easily upload and share their own information via personal computers and mobile phones. However, some information, such as adult content, is not appropriate for all users, notably children. Many SNS and BBS providers have been trying to monitor and remove harmful information that comes from their users. At the current stage, these companies manually check the users' sentences before publishing them. For these companies, even partial automation of such observation tasks will reduce the huge labor cost. Based on the above motivation and background, we have been focusing on filtering harmful text information. We have built a system that can utilize co-occurrences of words for text filtering. This is because word co-occurrences can be considered useful characteristics of sentences. Namely, the word co-occurrences might reflect the context of the sentences. This paper presents the methods, which utilize two-word and three-word co-occurrences. Our preliminary experiments demonstrated that the method using three-word co-occurrences can filter harmful information more effectively than that using two-word co-occurrences. Further, the experiment shows that our method can work better than the Bayesian filtering method, which has been used for typical spam-mail filtering. In addition, we present a method that combines the Bayesian filtering with our proposed methods. Finally, we show the two problems with these filtering methods and confirm one of the problems with Bayesian filtering.