Detection of cloaked web spam by using tag-based methods

Authors:
Jun-Lin Lin
Affiliations:
Department of Information Management, Yuan Ze University, 135 Yuan-Tung Road, Chung-Li 32003, Taiwan
Venue:
Expert Systems with Applications: An International Journal
Year:
2009

Citing 5
Cited 5

Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Detecting semantic cloaking on the web

Proceedings of the 15th international conference on World Wide Web
Applying lazy learning algorithms to tackle concept drift in spam filtering

Expert Systems with Applications: An International Journal
An HMM for detecting spam mail

Expert Systems with Applications: An International Journal
An incremental cluster-based approach to spam filtering

Expert Systems with Applications: An International Journal

Adversarial Web Search

Foundations and Trends in Information Retrieval
Cloak and dagger: dynamics of web search cloaking

Proceedings of the 18th ACM conference on Computer and communications security
Feature evaluation for web crawler detection with data mining techniques

Expert Systems with Applications: An International Journal
Survey on web spam detection: principles and algorithms

ACM SIGKDD Explorations Newsletter
Evaluating Arabic spam classifiers using link analysis

Proceedings of the 3rd International Conference on Information and Communication Systems

Quantified Score

Hi-index	12.05

Visualization

Abstract

Web spam attempts to influence search engine ranking algorithm in order to boost the rankings of specific web pages in search engine results. Cloaking is a widely adopted technique of concealing web spam by replying different content to search engines' crawlers from that displayed in a web browser. Previous work on cloaking detection is mainly based on the differences in terms and/or links between multiple copies of a URL retrieved from web browser and search engine crawler perspectives. This work presents three methods of using difference in tags to determine whether a URL is cloaked. Since the tags of a web page generally do not change as frequently and significantly as the terms and links of the web page, tag-based cloaking detection methods can work more effectively than the term- or link-based methods. The proposed methods are tested with a dataset of URLs covering short-, medium- and long-term users' interest. Experimental results indicate that the tag-based methods outperform term- or link-based methods in both precision and recall. Moreover, a Weka J4.8 classifier using a combination of term and tag features yields an accuracy rate of 90.48%.