On the Validity of a New SMS Spam Collection

  • Authors:
  • Jose Maria Gomez Hidalgo;Tiago A. Almeida;Akebo Yamakami

  • Affiliations:
  • -;-;-

  • Venue:
  • ICMLA '12 Proceedings of the 2012 11th International Conference on Machine Learning and Applications - Volume 02
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Mobile phones are becoming the latest target of electronic junk mail. Recent reports clearly indicate that the volume of SMS spam messages are dramatically increasing year by year. Probably, one of the major concerns in academic settings was the scarcity of public SMS spam datasets, that are sorely needed for validation and comparison of different classifiers. To address this issue, we have recently proposed a new SMS Spam Collection that, to the best of our knowledge, is the largest, public and real SMS dataset available for academic studies. However, as it has been created by augmenting a previously existing database built using roughly the same sources, it is sensible to certify that there are no duplicates coming from them. So, in this paper we offer a comprehensive analysis of the new SMS Spam Collection in order to ensure that this does not happen, since it may ease the task of learning SMS spam classifiers and, hence, it could compromise the evaluation of methods. The analysis of results indicate that the procedure followed does not lead to near-duplicates and, consequently, the proposed dataset is reliable to use for evaluating and comparing the performance achieved by different classifiers.