The anatomy of a large-scale hypertextual Web search engine
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Scaling personalized web search
WWW '03 Proceedings of the 12th international conference on World Wide Web
Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search
IEEE Transactions on Knowledge and Data Engineering
Spam, damn spam, and statistics: using statistical analysis to locate spam web pages
Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004
Detecting spam web pages through content analysis
Proceedings of the 15th international conference on World Wide Web
To randomize or not to randomize: space optimal summaries for hyperlink analysis
Proceedings of the 15th international conference on World Wide Web
Link spam detection based on mass estimation
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Transductive link spam detection
AIRWeb '07 Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
Combating web spam with trustrank
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Challenges in web search engines
IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Local computation of PageRank contributions
WAW'07 Proceedings of the 5th international conference on Algorithms and models for the web-graph
Looking into the past to better classify web spam
Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Nullification test collections for web spam and SEO
Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
A brief survey of computational approaches in social computing
IJCNN'09 Proceedings of the 2009 international joint conference on Neural Networks
False-name-proofness in social networks
WINE'10 Proceedings of the 6th international conference on Internet and network economics
Using patterns in the behavior of the random surfer to detect webspam beneficiaries
WISS'10 Proceedings of the 2010 international conference on Web information systems engineering
Webspam demotion: Low complexity node aggregation methods
Neurocomputing
The laplacian paradigm: emerging algorithms for massive graphs
TAMC'10 Proceedings of the 7th annual conference on Theory and Applications of Models of Computation
Reliability prediction of webpages in the medical domain
ECIR'12 Proceedings of the 34th European conference on Advances in Information Retrieval
Querying provenance for ranking and recommending
TaPP'12 Proceedings of the 4th USENIX conference on Theory and Practice of Provenance
Detecting Webspam Beneficiaries Using Information Collected by the Random Surfer
International Journal of Organizational and Collective Intelligence
Hi-index | 0.00 |
Since the link structure of the web is an important element in ranking systems on search engines, web spammers widely use the link structure of the web to increase the rank of their pages. Various link-based features of web pages have been introduced and have proven effective at identifying link spam. One particularly successful family of features (as described in the SpamRank algorithm), is based on examining the sets of pages that contribute most to the PageRank of a given vertex, called supporting sets. In a recent paper, the current authors described an algorithm for efficiently computing, for a single specified vertex, an approximation of its supporting sets. In this paper, we describe several link-based spam-detection features, both supervised and unsupervised, that can be derived from these approximate supporting sets. In particular, we examine the size of a node's supporting sets and the approximate l2 norm of the PageRank contributions from other nodes. As a supervised feature, we examine the composition of a node's supporting sets. We perform experiments on two labeled real data sets to demonstrate the effectiveness of these features for spam detection, and demonstrate that these features can be computed efficiently. Furthermore, we design a variation of PageRank (called Robust PageRank) that incorporates some of these features into its ranking, argue that this variation is more robust against link spam engineering, and give an algorithm for approximating Robust PageRank.