Spam, damn spam, and statistics: using statistical analysis to locate spam web pages

Authors:
Dennis Fetterly;Mark Manasse;Marc Najork
Affiliations:
Microsoft Research, Mountain View, CA;Microsoft Research, Mountain View, CA;Microsoft Research, Mountain View, CA
Venue:
Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004
Year:
2004

Citing 9
Cited 93

Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Graph structure in the Web

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
The Evolution of the Web and Implications for an Incremental Crawler

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Who Links to Whom: Mining Linkage between Web Sites

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
A large-scale study of the evolution of web pages

WWW '03 Proceedings of the 12th international conference on World Wide Web
Efficient URL caching for world wide web crawling

WWW '03 Proceedings of the 12th international conference on World Wide Web
Challenges in web search engines

ACM SIGIR Forum
The connectivity sonar: detecting site functionality by structural patterns

Proceedings of the fourteenth ACM conference on Hypertext and hypermedia
On the Evolution of Clusters of Near-Duplicate Web Pages

LA-WEB '03 Proceedings of the First Conference on Latin American Web Congress

Identifying link farm spam pages

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Crawling a country: better strategies than breadth-first for web page ordering

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Detecting phrase-level duplication on the world wide web

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Link spam alliances

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Discovering large dense subgraphs in massive graphs

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Spam: It's Not Just for Inboxes Anymore

Computer
Topical TrustRank: using topicality to combat web spam

Proceedings of the 15th international conference on World Wide Web
Site level noise removal for search engines

Proceedings of the 15th international conference on World Wide Web
Detecting spam web pages through content analysis

Proceedings of the 15th international conference on World Wide Web
Detecting semantic cloaking on the web

Proceedings of the 15th international conference on World Wide Web
Undue influence: eliminating the impact of link plagiarism on web search rankings

Proceedings of the 2006 ACM symposium on Applied computing
Report on the 7th Workshop on Distributed Data and Structures: (WDAS 2006)

ACM SIGMOD Record
Evaluation of crawling policies for a web-repository crawler

Proceedings of the seventeenth conference on Hypertext and hypermedia
Link spam detection based on mass estimation

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Lazy preservation: reconstructing websites by crawling the crawlers

WIDM '06 Proceedings of the 8th annual ACM international workshop on Web information and data management
Multi-level Link Structure Analysis Technqiue for Detecting Link Farm Spam Pages

WI-IATW '06 Proceedings of the 2006 IEEE/WIC/ACM international conference on Web Intelligence and Intelligent Agent Technology
Web Dragons: Inside the Myths of Search Engine Technology

Web Dragons: Inside the Myths of Search Engine Technology
Web searching, search engines and Information Retrieval

Information Services and Use
Characterization of national Web domains

ACM Transactions on Internet Technology (TOIT)
Spam double-funnel: connecting web spammers with advertisers

Proceedings of the 16th international conference on World Wide Web
Improving web spam classification using rank-time features

AIRWeb '07 Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
Improving web spam classifiers using link structure

AIRWeb '07 Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
Using spam farm to boost PageRank

AIRWeb '07 Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
Extracting link spam using biased random walks from spam seed sets

AIRWeb '07 Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
Measuring similarity to detect qualified links

AIRWeb '07 Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
A taxonomy of JavaScript redirection spam

AIRWeb '07 Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
Web spam detection via commercial intent analysis

AIRWeb '07 Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
Link analysis for Web spam detection

ACM Transactions on the Web (TWEB)
Tracking Web spam with HTML style similarities

ACM Transactions on the Web (TWEB)
Detecting splogs via temporal dynamics using self-similarity analysis

ACM Transactions on the Web (TWEB)
DirichletRank: Solving the zero-one gap problem of PageRank

ACM Transactions on Information Systems (TOIS)
Analyzing the impact of churn and malicious behavior on the quality of peer-to-peer web search

Proceedings of the 2008 ACM symposium on Applied computing
Improving web information indexing and retrieval based on center block duplication detection

International Journal of Innovative Computing and Applications
Efficient semi-streaming algorithms for local triangle counting in massive graphs

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Identifying Spam Web Pages Based on Content Similarity

ICCSA '08 Proceedings of the international conference on Computational Science and Its Applications, Part II
A large-scale study of automated web search traffic

AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web
Identifying web spam with user behavior analysis

AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web
Exploring linguistic features for web spam detection: a preliminary study

AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web
Latent dirichlet allocation in web spam filtering

AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web
Identifying video spammers in online social networks

AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web
Robust PageRank and locally computable spam detection features

AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web
Predicting web spam with HTTP session information

Proceedings of the 17th ACM conference on Information and knowledge management
Improvements of HITS Algorithms for Spam Links

IEICE - Transactions on Information and Systems
Sitemaps: above and beyond the crawl of duty

Proceedings of the 18th international conference on World wide web
A study of link farm distribution and evolution using a time series of web snapshots

Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Web spam filtering in internet archives

Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Web spam identification through language model analysis

Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Linked latent Dirichlet allocation in web spam filtering

Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Nullification test collections for web spam and SEO

Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Detecting Link Hijacking by Web Spammers

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Ranking billions of web pages using diodes

Communications of the ACM - A Blind Person's Interaction with Technology
Link spam target detection using page farms

ACM Transactions on Knowledge Discovery from Data (TKDD)
A framework for describing web repositories

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Detecting spammers and content promoters in online video social networks

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Detecting spam blogs: a machine learning approach

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
A comparison of fraud cues and classification methods for fake escrow website detection

Information Technology and Management
CUCWeb: a Catalan corpus built from the web

WAC '06 Proceedings of the 2nd International Workshop on Web as Corpus
Web Crawling

Foundations and Trends in Information Retrieval
Improvements of HITS algorithms for spam links

APWeb/WAIM'07 Proceedings of the joint 9th Asia-Pacific web and 8th international conference on web-age information management conference on Advances in data and web management
Identifying spam link generators for monitoring emerging web spam

Proceedings of the 4th workshop on Information credibility
Local computation of PageRank contributions

WAW'07 Proceedings of the 5th international conference on Algorithms and models for the web-graph
Using evidence based content trust model for spam detection

Expert Systems with Applications: An International Journal
Connectivity of the Thai web graph

APWeb'08 Proceedings of the 10th Asia-Pacific web conference on Progress in WWW research and development
On the robustness of google scholar against spam

Proceedings of the 21st ACM conference on Hypertext and hypermedia
Efficient algorithms for large-scale local triangle counting

ACM Transactions on Knowledge Discovery from Data (TKDD)
Temporal query log profiling to improve web search ranking

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Spam detection with a content-based random-walk algorithm

SMUC '10 Proceedings of the 2nd international workshop on Search and mining user-generated contents
Automatic checking of alternative texts on web pages

ICCHP'10 Proceedings of the 12th international conference on Computers helping people with special needs: Part I
Let web spammers expose themselves

Proceedings of the fourth ACM international conference on Web search and data mining
Removing web spam links from search engine results

Journal in Computer Virology
The dark side of the Internet: Attacks, costs and responses

Information Systems
Detecting spam blogs from blog search results

Information Processing and Management: an International Journal
Filtering artificial texts with statistical machine learning techniques

Language Resources and Evaluation
Spam detection in online classified advertisements

Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality
Adversarial Web Search

Foundations and Trends in Information Retrieval
Detecting fake websites: the contribution of statistical learning theory

MIS Quarterly
Combining textual content and hyperlinks in web spam detection

NLDB'11 Proceedings of the 16th international conference on Natural language processing and information systems
deSEO: combating search-result poisoning

SEC'11 Proceedings of the 20th USENIX conference on Security
Sampling the national deep web

DEXA'11 Proceedings of the 22nd international conference on Database and expert systems applications - Volume Part I
Web Spam Detection by Exploring Densely Connected Subgraphs

WI-IAT '11 Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
On the utility of incremental feature selection for the classification of textual data streams

PCI'05 Proceedings of the 10th Panhellenic conference on Advances in Informatics
Identifying Web Spam with the Wisdom of the Crowds

ACM Transactions on the Web (TWEB)
Thwarting the nigritude ultramarine: learning to identify link spam

ECML'05 Proceedings of the 16th European conference on Machine Learning
Survey on web spam detection: principles and algorithms

ACM SIGKDD Explorations Newsletter
Content-based analysis to detect Arabic web spam

Journal of Information Science
Analysis and detection of web spam by means of web content

IRFC'12 Proceedings of the 5th conference on Multidisciplinary Information Retrieval
Detecting Fake Medical Web Sites Using Recursive Trust Labeling

ACM Transactions on Information Systems (TOIS)
Effectively Detecting Content Spam on the Web Using Topical Diversity Measures

WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Ranking document clusters using markov random fields

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Cross-lingual web spam classification

Proceedings of the 22nd international conference on World Wide Web companion
SAAD, a content based Web Spam Analyzer and Detector

Journal of Systems and Software
Campaign extraction from social media

ACM Transactions on Intelligent Systems and Technology (TIST) - Special Section on Intelligent Mobile Knowledge Discovery and Management Systems and Special Issue on Social Web Mining
Solving graph data issues using a layered architecture approach with applications to web spam detection

Neural Networks

Quantified Score

Hi-index	0.00

Visualization

Abstract

The increasing importance of search engines to commercial web sites has given rise to a phenomenon we call "web spam", that is, web pages that exist only to mislead search engines into (mis)leading users to certain web sites. Web spam is a nuisance to users as well as search engines: users have a harder time finding the information they need, and search engines have to cope with an inflated corpus, which in turn causes their cost per query to increase. Therefore, search engines have a strong incentive to weed out spam web pages from their index.We propose that some spam web pages can be identified through statistical analysis: Certain classes of spam pages, in particular those that are machine-generated, diverge in some of their properties from the properties of web pages at large. We have examined a variety of such properties, including linkage structure, page content, and page evolution, and have found that outliers in the statistical distribution of these properties are highly likely to be caused by web spam.This paper describes the properties we have examined, gives the statistical distributions we have observed, and shows which kinds of outliers are highly correlated with web spam.