C4.5: programs for machine learning
C4.5: programs for machine learning
Machine Learning
Evaluating cost-sensitive Unsolicited Bulk Email categorization
Proceedings of the 2002 ACM symposium on Applied computing
Challenges in web search engines
ACM SIGIR Forum
The connectivity sonar: detecting site functionality by structural patterns
Proceedings of the fourteenth ACM conference on Hypertext and hypermedia
Spam, damn spam, and statistics: using statistical analysis to locate spam web pages
Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004
Identifying link farm spam pages
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Detecting phrase-level duplication on the world wide web
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
VLDB '05 Proceedings of the 31st international conference on Very large data bases
Detecting spam web pages through content analysis
Proceedings of the 15th international conference on World Wide Web
A taxonomy of JavaScript redirection spam
AIRWeb '07 Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
MapReduce: simplified data processing on large clusters
Communications of the ACM - 50th anniversary issue: 1958 - 2008
Improved use of continuous attributes in C4.5
Journal of Artificial Intelligence Research
A study of cross-validation and bootstrap for accuracy estimation and model selection
IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2
The WEKA data mining software: an update
ACM SIGKDD Explorations Newsletter
Detection and analysis of drive-by-download attacks and malicious JavaScript code
Proceedings of the 19th international conference on World wide web
AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 1
Foundations and Trends in Information Retrieval
Hi-index | 0.00 |
Web Spam is one of the main difficulties that crawlers have to overcome. According to Gyöngyi and Garcia-Molina it is defined as "any deliberate human action that is meant to trigger an unjustifiably favourable relevance or importance of some web pages considering the pages' true value". There are several studies on characterising and detecting Web Spam pages. However, none of them deals with all the possible kinds of Web Spam. This paper shows an analysis of different kinds of Web Spam pages and identifies new elements that characterise it. Taking them into account, we propose a new Web Spam detection system called SAAD, which is based on a set of heuristics and their use in a C4.5 classifier. Its results are also improved by means of Bagging and Boosting techniques. We have also tested our system in some well-known Web Spam datasets and we have found it to be very effective.