Copy detection mechanisms for digital documents
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
Mining the peanut gallery: opinion extraction and semantic classification of product reviews
WWW '03 Proceedings of the 12th international conference on World Wide Web
On the Resemblance and Containment of Documents
SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Winnowing: local algorithms for document fingerprinting
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
On the Evolution of Clusters of Near-Duplicate Web Pages
LA-WEB '03 Proceedings of the First Conference on Latin American Web Congress
Constructing a text corpus for inexact duplicate detection
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Detecting near-duplicates for web crawling
Proceedings of the 16th international conference on World Wide Web
WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Lexicon randomization for near-duplicate detection with I-Match
The Journal of Supercomputing
Computational methods in authorship attribution
Journal of the American Society for Information Science and Technology
Reducing the Plagiarism Detection Search Space on the Basis of the Kullback-Leibler Distance
CICLing '09 Proceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing
Evaluation of Approaches for Dimensionality Reduction Applied with Naive Bayes Anti-Spam Filters
ICMLA '09 Proceedings of the 2009 International Conference on Machine Learning and Applications
Filtering spams using the minimum description length principle
Proceedings of the 2010 ACM Symposium on Applied Computing
Adaptive near-duplicate detection via similarity learning
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Plagiarism detection across distant language pairs
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Fixing the threshold for effective detection of near duplicate web documents in web crawling
ADMA'10 Proceedings of the 6th international conference on Advanced data mining and applications: Part I
Language Resources and Evaluation
Cross-language plagiarism detection
Language Resources and Evaluation
Finding deceptive opinion spam by any stretch of the imagination
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Contributions to the study of SMS spam filtering: new collection and results
Proceedings of the 11th ACM symposium on Document engineering
PPChecker: plagiarism pattern checker in document copy detection
TSD'06 Proceedings of the 9th international conference on Text, Speech and Dialogue
Overview of the third international workshop on search and mining user-generated contents
Proceedings of the 20th ACM international conference on Information and knowledge management
Detecting near-duplicate documents using sentence-level features and supervised learning
Expert Systems with Applications: An International Journal
Automatically generated spam detection based on sentence-level topic information
Proceedings of the 22nd international conference on World Wide Web companion
Hi-index | 0.01 |
Today, the number of spam text messages has grown in number, mainly because companies are looking for free advertising. For the users is very important to filter these kinds of spam messages that can be viewed as near-duplicate texts because mostly created from templates. The identification of spam text messages is a very hard and time-consuming task and it involves to carefully scanning hundreds of text messages. Therefore, since the task of near-duplicate detection can be seen as a specific case of plagiarism detection, we investigated whether plagiarism detection tools could be used as filters for spam text messages. Moreover we solve the near-duplicate detection problem on the basis of a clustering approach using CLUTO framework. We carried out some preliminary experiments on the SMS Spam Collection that recently was made available for research purposes. The results were compared with the ones obtained with the CLUTO. Althought plagiarism detection tools detect a good number of near-duplicate SMS spam messages even better results are obtained with the CLUTO clustering tool.