Filtering harmful sentences based on three-word co-occurrence

  • Authors:
  • Yutaro Fujii;Takuya Yoshimura;Takayuki Ito

  • Affiliations:
  • Institute of Technology, Gokiso-cho, Showa-ku, Nagoya, fujii;Nagoya Institute of Technology, Gokiso-cho, Showa-ku, Nagoya;Institute, University of Tokyo, Gokiso-cho, Showa-ku, Nagoya

  • Venue:
  • Proceedings of the 8th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Many bulletin board systems and social network services have become popular collaboration tools in recent years. In such systems, users can easily upload and share their own information via personal computers and mobile phones. However, some information, such as adult content, is not appropriate for all users, notably children. Many SNS and BBS providers have been trying to monitor and remove harmful information that comes from their users. At the current stage, these companies manually check the users' sentences before publishing them. For these companies, even partial automation of such observation tasks will reduce the huge labor cost. Based on the above motivation and background, we have been focusing on filtering harmful text information. We have built a system that can utilize co-occurrences of words for text filtering. This is because word co-occurrences can be considered useful characteristics of sentences. Namely, the word co-occurrences might reflect the context of the sentences. This paper presents the methods, which utilize two-word and three-word co-occurrences. Our preliminary experiments demonstrated that the method using three-word co-occurrences can filter harmful information more effectively than that using two-word co-occurrences. Further, the experiment shows that our method can work better than the Bayesian filtering method, which has been used for typical spam-mail filtering. In addition, we present a method that combines the Bayesian filtering with our proposed methods. Finally, we show the two problems with these filtering methods and confirm one of the problems with Bayesian filtering.