Communications of the ACM - Special issue on parallelism
Computer systems that learn: classification and prediction methods from statistics, neural nets, machine learning, and expert systems
Machine Learning
Foundations of statistical natural language processing
Foundations of statistical natural language processing
Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews
ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
MapReduce: simplified data processing on large clusters
Communications of the ACM - 50th anniversary issue: 1958 - 2008
Bigtable: A Distributed Storage System for Structured Data
ACM Transactions on Computer Systems (TOCS)
Filtering Harmful Sentences Based on Multiple Word Co-occurrence
ICIS '10 Proceedings of the 2010 IEEE/ACIS 9th International Conference on Computer and Information Science
Proposal of impression mining from news articles
KES'05 Proceedings of the 9th international conference on Knowledge-Based Intelligent Information and Engineering Systems - Volume Part I
Support vector machines for spam categorization
IEEE Transactions on Neural Networks
Hi-index | 0.00 |
Many bulletin board systems and social network services have become popular collaboration tools in recent years. In such systems, users can easily upload and share their own information via personal computers and mobile phones. However, some information, such as adult content, is not appropriate for all users, notably children. Many SNS and BBS providers have been trying to monitor and remove harmful information that comes from their users. At the current stage, these companies manually check the users' sentences before publishing them. For these companies, even partial automation of such observation tasks will reduce the huge labor cost. Based on the above motivation and background, we have been focusing on filtering harmful text information. We have built a system that can utilize co-occurrences of words for text filtering. This is because word co-occurrences can be considered useful characteristics of sentences. Namely, the word co-occurrences might reflect the context of the sentences. This paper presents the methods, which utilize two-word and three-word co-occurrences. Our preliminary experiments demonstrated that the method using three-word co-occurrences can filter harmful information more effectively than that using two-word co-occurrences. Further, the experiment shows that our method can work better than the Bayesian filtering method, which has been used for typical spam-mail filtering. In addition, we present a method that combines the Bayesian filtering with our proposed methods. Finally, we show the two problems with these filtering methods and confirm one of the problems with Bayesian filtering.