A large-scale study of link spam detection by graph algorithms

Authors:
Hiroo Saito;Masashi Toyoda;Masaru Kitsuregawa;Kazuyuki Aihara
Affiliations:
Aihara Complexity Modelling Project, ERATO, JST, Tokyo, Japan and University of Tokyo, Tokyo, Japan;University of Tokyo, Tokyo, Japan;University of Tokyo, Tokyo, Japan;Aihara Complexity Modelling Project, ERATO, JST, Tokyo, Japan and University of Tokyo, Tokyo, Japan
Venue:
AIRWeb '07 Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
Year:
2007

Citing 7
Cited 15

Authoritative sources in a hyperlinked environment

Journal of the ACM (JACM)
Graph structure in the Web

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Efficient identification of Web communities

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Breadth-first crawling yields high-quality pages

Proceedings of the 10th international conference on World Wide Web
Identifying link farm spam pages

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Link spam alliances

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Combating web spam with trustrank

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30

Adversarial Information Retrieval on the Web (AIRWeb 2007)

ACM SIGIR Forum
A study of link farm distribution and evolution using a time series of web snapshots

Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Detecting Link Hijacking by Web Spammers

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Isolation concepts for clique enumeration: Comparison and computational experiments

Theoretical Computer Science
Identifying spam link generators for monitoring emerging web spam

Proceedings of the 4th workshop on Information credibility
On the robustness of google scholar against spam

Proceedings of the 21st ACM conference on Hypertext and hypermedia
Fast and Compact Web Graph Representations

ACM Transactions on the Web (TWEB)
Portfolio: finding relevant functions and their usage

Proceedings of the 33rd International Conference on Software Engineering
Web Spam Detection by Exploring Densely Connected Subgraphs

WI-IAT '11 Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Practical representations for web and social graphs

Proceedings of the 20th ACM international conference on Information and knowledge management
Extended compact web graph representations

Algorithms and Applications
Using site-level connections to estimate link confidence

Journal of the American Society for Information Science and Technology
Detecting Social Bookmark Spams Using Multiple User Accounts

ASONAM '12 Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012)
Streaming algorithms for k-core decomposition

Proceedings of the VLDB Endowment
Compact representation of Web graphs with extended functionality

Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Link spam refers to attempts to promote the ranking of spammers' web sites by deceiving link-based ranking algorithms in search engines. Spammers often create densely connected link structure of sites so called "link farm". In this paper, we study the overall structure and distribution of link farms in a large-scale graph of the Japanese Web with 5.8 million sites and 283 million links. To examine the spam structure, we apply three graph algorithms to the web graph. First, the web graph is decomposed into strongly connected components (SCC). Beside the largest SCC (core) in the center of the web, we have observed that most of large components consist of link farms. Next, to extract spam sites in the core, we enumerate maximal cliques as seeds of link farms. Finally, we expand these link farms as a reliable spam seed set by a minimum cut technique that separates links among spam and non-spam sites. We found about 0.6 million spam sites in SCCs around the core, and extracted additional 8 thousand and 49 thousand sites as spams with high precision in the core by the maximal clique enumeration and by the minimum cut technique, respectively.