Fighting against web spam: a novel propagation method based on click-through data

Authors:
Chao Wei;Yiqun Liu;Min Zhang;Shaoping Ma;Liyun Ru;Kuo Zhang
Affiliations:
Department of Computer Science and Technology, Tsinghua University, Beijing, China;Department of Computer Science and Technology, Tsinghua University, Beijing, China;Department of Computer Science and Technology, Tsinghua University, Beijing, China;Department of Computer Science and Technology, Tsinghua University, Beijing, China;Department of Computer Science and Technology, Tsinghua University, Beijing, China;Sogou Inc., Beijing, China
Venue:
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Year:
2012

Citing 20
Cited 0

Analysis of a very large web search engine query log

ACM SIGIR Forum
Identifying link farm spam pages

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Challenges in running a commercial search engine

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Spam: It's Not Just for Inboxes Anymore

Computer
Detecting spam web pages through content analysis

Proceedings of the 15th international conference on World Wide Web
Detecting semantic cloaking on the web

Proceedings of the 15th international conference on World Wide Web
Improving web search ranking by incorporating user behavior information

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
A taxonomy of JavaScript redirection spam

AIRWeb '07 Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
Know your neighbors: web spam detection using the web topology

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Winnowing wheat from the chaff: propagating trust to sift spam from the web

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Combating web spam with trustrank

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Tracking Web spam with HTML style similarities

ACM Transactions on the Web (TWEB)
BrowseRank: letting web users vote for page importance

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Identifying web spam with user behavior analysis

AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web
Cleaning search results using term distance features

AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web
Exploring linguistic features for web spam detection: a preliminary study

AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web
Web spam identification through language model analysis

Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Let web spammers expose themselves

Proceedings of the fourth ACM international conference on Web search and data mining
Web spam classification: a few features worth more

Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality
Adversarial Web Search

Foundations and Trends in Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Combating Web spam is one of the greatest challenges for Web search engines. State-of-the-art anti-spam techniques focus mainly on detecting varieties of spam strategies, such as content spamming and link-based spamming. Although these anti-spam approaches have had much success, they encounter problems when fighting against a continuous barrage of new types of spamming techniques. We attempt to solve the problem from a new perspective, by noticing that queries that are more likely to lead to spam pages/sites have the following characteristics: 1) they are popular or reflect heavy demands for search engine users and 2) there are usually few key resources or authoritative results for them. From these observations, we propose a novel method that is based on click-through data analysis by propagating the spamicity score iteratively between queries and URLs from a few seed pages/sites. Once we obtain the seed pages/sites, we use the link structure of the click-through bipartite graph to discover other pages/sites that are likely to be spam. Experiments show that our algorithm is both efficient and effective in detecting Web spam. Moreover, combining our method with some popular anti-spam techniques such as TrustRank achieves improvement compared with each technique taken individually.