A Smart Filtering System for Newly Coined Profanities by Using Approximate String Alignment

  • Authors:
  • Taijin Yoon;Sun-Young Park;Hwan-Gue Cho

  • Affiliations:
  • -;-;-

  • Venue:
  • CIT '10 Proceedings of the 2010 10th IEEE International Conference on Computer and Information Technology
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Verbal abuse is becoming a serious social problem in online communication, because anonymity makes it easier to use profanities. Detecting and removing some words that have been registered in a forbidden list is a straightforward filtering method. This is simple, but preparing the forbidden word list is difficult as newly coined words have to be added to the lexicon. Especially Korean is a type of agglutinative language, so the construction of new variations of a vulgar word is easy without causing difficulties in textual communications in an online environment. In this paper we propose a new method to detect all variations of a vulgar word with phoneme modification by applying a phoneme based string alignment. However, aligning a query word against all vulgar words registered in a database takes time and its computation is difficult. We propose a R*-tree based searching algorithm to overcome this expensive computation. The method applies the metric space property of string edit distance. We prepared a word database with more than 9300 prototype vulgar words for experiment. For a given query word, our algorithm quickly finds the best-aligned candidate word(0.006 sec. with 1000 words), which are within an edit distance equals of one unit. Our contribution is that we empirically found the number of pivot words to create a near optimal searching space.