Web spam detection: new classification features based on qualified link analysis and language models

Authors:
Lourdes Araujo;Juan Martinez-Romo
Affiliations:
NLP & IR Group, UNED, Madrid, Spain;NLP & IR Group, UNED, Madrid, Spain
Venue:
IEEE Transactions on Information Forensics and Security
Year:
2010

Citing 17
Cited 4

Elements of information theory

Elements of information theory
Improved algorithms for topic distillation in a hyperlinked environment

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
A language modeling approach to information retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Effective site finding using link anchor information

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Title language model for information retrieval

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Analysis of anchor text for web search

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Lucene in Action (In Action series)

Lucene in Action (In Action series)
Detecting spam web pages through content analysis

Proceedings of the 15th international conference on World Wide Web
Detecting nepotistic links by language model disagreement

Proceedings of the 15th international conference on World Wide Web
A reference collection for web spam

ACM SIGIR Forum
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Measuring similarity to detect qualified links

AIRWeb '07 Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
Know your neighbors: web spam detection using the web topology

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Web spam identification through content and hyperlinks

AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web
Predicting web spam with HTTP session information

Proceedings of the 17th ACM conference on Information and knowledge management
Web spam identification through language model analysis

Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Retrieving broken web links using an approach based on contextual information

Proceedings of the 20th ACM conference on Hypertext and hypermedia

Content-based analysis to detect Arabic web spam

Journal of Information Science
Detecting Fake Medical Web Sites Using Recursive Trust Labeling

ACM Transactions on Information Systems (TOIS)
Detecting malicious tweets in trending topics using a statistical analysis of language

Expert Systems with Applications: An International Journal
Effectively Detecting Content Spam on the Web Using Topical Diversity Measures

WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01

Quantified Score

Hi-index	0.00

Visualization

Abstract

Web spam is a serious problem for search engines because the quality of their results can be severely degraded by the presence of this kind of page. In this paper, we present an efficient spam detection system based on a classifier that combines new link-based features with language-model (LM)-based ones. These features are not only related to quantitative data extracted from the Web pages, but also to qualitative properties, mainly of the page links. We consider, for instance, the ability of a search engine to find, using information provided by the page for a given link, the page that the link actually points at. This can be regarded as indicative of the link reliability. We also check the coherence between a page and another one pointed at by any of its links. Two pages linked by a hyperlink should be semantically related, by at least a weak contextual relation. Thus, we apply an LM approach to different sources of information from a Web page that belongs to the context of a link, in order to provide high-quality indicators of Web spam. We have specifically applied the Kullback-Leibler divergence on different combinations ofthese sources of information in order to characterize the relationship between two linked pages. The result is a system that significantly improves the detection of Web spam using fewer features, on two large and public datasets such as WEBSPAM-UK2006 and WEBSPAM-UK2007.