Using latent semantic indexing to filter spam

  • Authors:
  • Kevin R. Gee

  • Affiliations:
  • The University of Texas at Arlington, Arlington, TX

  • Venue:
  • Proceedings of the 2003 ACM symposium on Applied computing
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

Past research has explored the effectiveness of a Naïve Bayesian classifier when filtering unsolicited bulk email (spam). Results have shown that the degree of precision of this approach is generally superior to the degree of recall. This study evaluates the effectiveness of a classifier incorporating Latent Semantic Indexing (LSI) to filter spam email on corpus used in previous studies. Results show that email classifiers using LSI to filter spam enjoy a very high degree of both recall and precision, no matter if the corpus is treated using a stop list or a lemmatizer. While using LSI leads to precision roughly equal to that of using a Naïve Bayesian approach, the LSI technique has a substantially higher recall and is more effective under certain conditions.Results show that incorporating LSI into an anti-spam filter is viable, particularly in implementations when misclassified legitimate messages are not arbitrarily deleted. Other inferences are drawn to the applicability of this method to other text mining tasks.