WebBase: a repository of Web pages
Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
A vector space model for automatic indexing
Communications of the ACM
Machine Learning
A Comparative Study on Feature Selection in Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
An extensive empirical study of feature selection metrics for text classification
The Journal of Machine Learning Research
Spam, damn spam, and statistics: using statistical analysis to locate spam web pages
Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004
Identifying link farm spam pages
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Detecting spam web pages through content analysis
Proceedings of the 15th international conference on World Wide Web
An introduction to ROC analysis
Pattern Recognition Letters - Special issue: ROC analysis in pattern recognition
A reference collection for web spam
ACM SIGIR Forum
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Know your neighbors: web spam detection using the web topology
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Countering web spam with credibility-based link analysis
Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing
Combating web spam with trustrank
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
The ghost in the browser analysis of web-based malware
HotBots'07 Proceedings of the first conference on First Workshop on Hot Topics in Understanding Botnets
SpyProxy: execution-based detection of malicious web content
SS'07 Proceedings of 16th USENIX Security Symposium on USENIX Security Symposium
Spamscatter: characterizing internet scam hosting infrastructure
SS'07 Proceedings of 16th USENIX Security Symposium on USENIX Security Symposium
A study of cross-validation and bootstrap for accuracy estimation and model selection
IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2
Thwarting the nigritude ultramarine: learning to identify link spam
ECML'05 Proceedings of the 16th European conference on Machine Learning
Web spam detection: new classification features based on qualified link analysis and language models
IEEE Transactions on Information Forensics and Security
Web spam classification: a few features worth more
Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality
Foundations and Trends in Information Retrieval
A social-spam detection framework
Proceedings of the 8th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference
User-Assisted host-based detection of outbound malware traffic
ICICS'09 Proceedings of the 11th international conference on Information and Communications Security
Survey on web spam detection: principles and algorithms
ACM SIGKDD Explorations Newsletter
Hi-index | 0.00 |
Web spam is a widely-recognized threat to the quality and security of the Web. Web spam pages pollute search engine indexes, burden Web crawlers and Web mining services, and expose users to dangerous Web-borne malware. To defend against Web spam, most previous research analyzes the contents of Web pages and the link structure of the Web graph. Unfortunately, these heavyweight approaches require full downloads of both legitimate and spam pages to be effective, making real-time deployment of these techniques infeasible for Web browsers, high-performance Web crawlers, and real-time Web applications. In this paper, we present a lightweight, predictive approach to Web spam classification that relies exclusively on HTTP session information (i.e., hosting IP addresses and HTTP session headers). Concretely, we built an HTTP session classifier based on our predictive technique, and by incorporating this classifier into HTTP retrieval operations, we are able to detect Web spam pages before the actual content transfer. As a result, our approach protects Web users from Web-propagated malware, and it generates significant bandwidth and storage savings. By applying our predictive technique to a corpus of almost 350,000 Web spam instances and almost 400,000 legitimate instances, we were able to successfully detect 88.2% of the Web spam pages with a false positive rate of only 0.4%. These classification results are superior to previous evaluation results obtained with traditional link-based and content-based techniques. Additionally, our experiments show that our approach saves an average of 15.4 KB of bandwidth and storage resources for every successfully identified Web spam page, while only adding an average of 101 microseconds to each HTTP retrieval operation. Therefore, our predictive technique can be successfully deployed in applications that demand real-time spam detection.