Detection of near-duplicate user generated contents: the SMS spam collection

  • Authors:
  • Enrique Vallés;Paolo Rosso

  • Affiliations:
  • Universidad Politécnica de Valencia, Valencia, Spain;Universidad Politécnica de Valencia, Valencia, Spain

  • Venue:
  • Proceedings of the 3rd international workshop on Search and mining user-generated contents
  • Year:
  • 2011

Quantified Score

Hi-index 0.01

Visualization

Abstract

Today, the number of spam text messages has grown in number, mainly because companies are looking for free advertising. For the users is very important to filter these kinds of spam messages that can be viewed as near-duplicate texts because mostly created from templates. The identification of spam text messages is a very hard and time-consuming task and it involves to carefully scanning hundreds of text messages. Therefore, since the task of near-duplicate detection can be seen as a specific case of plagiarism detection, we investigated whether plagiarism detection tools could be used as filters for spam text messages. Moreover we solve the near-duplicate detection problem on the basis of a clustering approach using CLUTO framework. We carried out some preliminary experiments on the SMS Spam Collection that recently was made available for research purposes. The results were compared with the ones obtained with the CLUTO. Althought plagiarism detection tools detect a good number of near-duplicate SMS spam messages even better results are obtained with the CLUTO clustering tool.