On-line spam filter fusion

  • Authors:
  • Thomas R. Lynam;Gordon V. Cormack;David R. Cheriton

  • Affiliations:
  • University of Waterloo, Waterloo, Ontario, Canada;University of Waterloo, Waterloo, Ontario, Canada;University of Waterloo, Waterloo, Ontario, Canada

  • Venue:
  • SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

We show that a set of independently developed spam filters may be combined in simple ways to provide substantially better filtering than any of the individual filters. The results of fifty-three spam filters evaluated at the TREC 2005 Spam Track were combined post-hoc so as to simulate the parallel on-line operation of the filters. The combined results were evaluated using the TREC methodology, yielding more than a factor of two improvement over the best filter. The simplest method -- averaging the binary classifications returned by the individual filters -- yields a remarkably good result. A new method -- averaging log-odds estimates based on the scores returned by the individual filters -- yields a somewhat better result, and provides input to SVM- and logistic-regression-based stacking methods. The stacking methods appear to provide further improvement, but only for very large corpora. Of the stacking methods, logistic regression yields the better result. Finally, we show that it is possible to select a priori small subsets of the filters that, when combined, still outperform the best individual filter by a substantial margin.