Survey on web spam detection: principles and algorithms

Authors:
Nikita Spirin;Jiawei Han
Affiliations:
University of Illinois at Urbana-Champaign, Urbana, IL, USA;University of Illinois at Urbana-Champaign, Urbana, IL, USA
Venue:
ACM SIGKDD Explorations Newsletter
Year:
2012

Citing 71
Cited 7

Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Multilevel k-way partitioning scheme for irregular graphs

Journal of Parallel and Distributed Computing
Improved algorithms for topic distillation in a hyperlinked environment

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Authoritative sources in a hyperlinked environment

Journal of the ACM (JACM)
Analysis of a very large web search engine query log

ACM SIGIR Forum
Topical locality in the Web

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
A vector space model for automatic indexing

Communications of the ACM
Finding authorities and hubs from link structures on the World Wide Web

Proceedings of the 10th international conference on World Wide Web
SALSA: the stochastic approach for link-structure analysis

ACM Transactions on Information Systems (TOIS)
Topic-sensitive PageRank

Proceedings of the 11th international conference on World Wide Web
Improvement of HITS-based algorithms on web documents

Proceedings of the 11th international conference on World Wide Web
Mining the Web: Discovering Knowledge from HyperText Data

Mining the Web: Discovering Knowledge from HyperText Data
Using PageRank to Characterize Web Structure

COCOON '02 Proceedings of the 8th Annual International Conference on Computing and Combinatorics
SimRank: a measure of structural-context similarity

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Scaling personalized web search

WWW '03 Proceedings of the 12th international conference on World Wide Web
Challenges in web search engines

ACM SIGIR Forum
The connectivity sonar: detecting site functionality by structural patterns

Proceedings of the fourteenth ACM conference on Hypertext and hypermedia
Ranking the web frontier

Proceedings of the 13th international conference on World Wide Web
Propagation of trust and distrust

Proceedings of the 13th international conference on World Wide Web
Adversarial classification

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Density-based spam detector

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Spam, damn spam, and statistics: using statistical analysis to locate spam web pages

Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004
Simple BM25 extension to multiple weighted fields

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Inside PageRank

ACM Transactions on Internet Technology (TOIT)
Analysis and improvement of HITS algorithm for detecting Web communities

Systems and Computers in Japan
Identifying link farm spam pages

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Accurately interpreting clickthrough data as implicit feedback

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Detecting phrase-level duplication on the world wide web

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Link spam alliances

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Site level noise removal for search engines

Proceedings of the 15th international conference on World Wide Web
Detecting spam web pages through content analysis

Proceedings of the 15th international conference on World Wide Web
Detecting semantic cloaking on the web

Proceedings of the 15th international conference on World Wide Web
Undue influence: eliminating the impact of link plagiarism on web search rankings

Proceedings of the 2006 ACM symposium on Applied computing
Generalizing PageRank: damping functions for link-based ranking algorithms

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Link spam detection based on mass estimation

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
A reference collection for web spam

ACM SIGIR Forum
Spam double-funnel: connecting web spammers with advertisers

Proceedings of the 16th international conference on World Wide Web
Anchor-based proximity measures

Proceedings of the 16th international conference on World Wide Web
Splog detection using self-similarity analysis on blog temporal dynamics

AIRWeb '07 Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
Improving web spam classification using rank-time features

AIRWeb '07 Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
Improving web spam classifiers using link structure

AIRWeb '07 Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
Transductive link spam detection

AIRWeb '07 Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
A taxonomy of JavaScript redirection spam

AIRWeb '07 Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
Web spam detection via commercial intent analysis

AIRWeb '07 Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
Know your neighbors: web spam detection using the web topology

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Countering web spam with credibility-based link analysis

Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing
Fighting Spam on Social Web Sites: A Survey of Approaches and Future Challenges

IEEE Internet Computing
Combating web spam with trustrank

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
The anatomy of Clickbot.A

HotBots'07 Proceedings of the first conference on First Workshop on Hot Topics in Understanding Botnets
User behavior oriented web spam detection

Proceedings of the 17th international conference on World Wide Web
Improving web spam detection with re-extracted features

Proceedings of the 17th international conference on World Wide Web
BrowseRank: letting web users vote for page importance

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Information Retrieval

Introduction to Information Retrieval
Exploring linguistic features for web spam detection: a preliminary study

AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web
Are click-through data adequate for learning web search rankings?

Proceedings of the 17th ACM conference on Information and knowledge management
Predicting web spam with HTTP session information

Proceedings of the 17th ACM conference on Information and knowledge management
Dr. Searcher and Mr. Browser: a unified hyperlink-click graph

Proceedings of the 17th ACM conference on Information and knowledge management
Detection of cloaked web spam by using tag-based methods

Expert Systems with Applications: An International Journal
Statistical Language Models for Information Retrieval

Statistical Language Models for Information Retrieval
Link based small sample learning for web spam detection

Proceedings of the 18th international conference on World wide web
Link spam target detection using page farms

ACM Transactions on Knowledge Discovery from Data (TKDD)
Detecting spam blogs: a machine learning approach

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
A survey of learning-based techniques of email spam filtering

Artificial Intelligence Review
Link analysis, eigenvectors and stability

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2
An effective method for combating malicious scripts clickbots

ESORICS'09 Proceedings of the 14th European conference on Research in computer security
Graph regularization methods for Web spam detection

Machine Learning
Let web spammers expose themselves

Proceedings of the fourth ACM international conference on Web search and data mining
Web spam classification: a few features worth more

Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality
Adversarial Web Search

Foundations and Trends in Information Retrieval
On the evolution of clusters of near-duplicate web pages

Journal of Web Engineering
Thwarting the nigritude ultramarine: learning to identify link spam

ECML'05 Proceedings of the 16th European conference on Machine Learning

Content-based analysis to detect Arabic web spam

Journal of Information Science
Effectively Detecting Content Spam on the Web Using Topical Diversity Measures

WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
SMAT: an input adaptive auto-tuner for sparse matrix-vector multiplication

Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation
Spotting opinion spammers using behavioral footprints

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Russian web spam evolution: yandex experience

Proceedings of the 22nd international conference on World Wide Web companion
Automatically generated spam detection based on sentence-level topic information

Proceedings of the 22nd international conference on World Wide Web companion
Ranking fraud detection for mobile apps: a holistic view

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Search engines became a de facto place to start information acquisition on the Web. Though due to web spam phenomenon, search results are not always as good as desired. Moreover, spam evolves that makes the problem of providing high quality search even more challenging. Over the last decade research on adversarial information retrieval has gained a lot of interest both from academia and industry. In this paper we present a systematic review of web spam detection techniques with the focus on algorithms and underlying principles. We categorize all existing algorithms into three categories based on the type of information they use: content-based methods, link-based methods, and methods based on non-traditional data such as user behaviour, clicks, HTTP sessions. In turn, we perform a subcategorization of link-based category into five groups based on ideas and principles used: labels propagation, link pruning and reweighting, labels refinement, graph regularization, and featurebased. We also define the concept of web spam numerically and provide a brief survey on various spam forms. Finally, we summarize the observations and underlying principles applied for web spam detection.