Evaluating cost-sensitive Unsolicited Bulk Email categorization
Proceedings of the 2002 ACM symposium on Applied computing
Challenges in web search engines
ACM SIGIR Forum
The connectivity sonar: detecting site functionality by structural patterns
Proceedings of the fourteenth ACM conference on Hypertext and hypermedia
Spam, damn spam, and statistics: using statistical analysis to locate spam web pages
Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004
Identifying link farm spam pages
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Detecting phrase-level duplication on the world wide web
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
VLDB '05 Proceedings of the 31st international conference on Very large data bases
Detecting spam web pages through content analysis
Proceedings of the 15th international conference on World Wide Web
A taxonomy of JavaScript redirection spam
AIRWeb '07 Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
MapReduce: simplified data processing on large clusters
Communications of the ACM - 50th anniversary issue: 1958 - 2008
A study of cross-validation and bootstrap for accuracy estimation and model selection
IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2
Detection and analysis of drive-by-download attacks and malicious JavaScript code
Proceedings of the 19th international conference on World wide web
AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 1
Hi-index | 0.00 |
Web Spam is one of the main difficulties that crawlers have to overcome and therefore one of the main problems of the WWW. There are several studies about characterising and detecting Web Spam pages. However, none of them deals with all the possible kinds of Web Spam. This paper shows an analysis of different kinds of Web Spam pages and identifies new elements that characterise it, to define heuristics which are able to partially detect them. We also discuss and explain several heuristics from the point of view of their effectiveness and computational efficiency. Taking them into account, we study several sets of heuristics and demonstrate how they improve the current results. Finally, we propose a new Web Spam detection system called SAAD (Spam Analyzer And Detector), which is based on the set of proposed heuristics and their use in a C4.5 classifier improved by means of Bagging and Boosting techniques. We have also tested our system in some well known Web Spam datasets and we have found it to be very effective.