Detecting spam web pages through content analysis

Authors:
Alexandros Ntoulas;Marc Najork;Mark Manasse;Dennis Fetterly
Affiliations:
UCLA Computer Science Dept., Los Angeles, CA;Microsoft Research, Mountain View, CA;Microsoft Research, Mountain View, CA;Microsoft Research, Mountain View, CA
Venue:
Proceedings of the 15th international conference on World Wide Web
Year:
2006

Citing 13
Cited 139

C4.5: programs for machine learning

C4.5: programs for machine learning
Bagging predictors

Machine Learning
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Evaluating cost-sensitive Unsolicited Bulk Email categorization

Proceedings of the 2002 ACM symposium on Applied computing
A decision-theoretic generalization of on-line learning and an application to boosting

EuroCOLT '95 Proceedings of the Second European Conference on Computational Learning Theory
Challenges in web search engines

ACM SIGIR Forum
The connectivity sonar: detecting site functionality by structural patterns

Proceedings of the fourteenth ACM conference on Hypertext and hypermedia
Spam, damn spam, and statistics: using statistical analysis to locate spam web pages

Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004
Identifying link farm spam pages

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Detecting phrase-level duplication on the world wide web

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Link spam alliances

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Combating web spam with trustrank

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Bagging, boosting, and C4.S

AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 1

The portrait of a common HTML web page

Proceedings of the 2006 ACM symposium on Document engineering
A reference collection for web spam

ACM SIGIR Forum
Spam double-funnel: connecting web spammers with advertisers

Proceedings of the 16th international conference on World Wide Web
Review spam detection

Proceedings of the 16th international conference on World Wide Web
Splog detection using self-similarity analysis on blog temporal dynamics

AIRWeb '07 Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
Improving web spam classification using rank-time features

AIRWeb '07 Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
Improving web spam classifiers using link structure

AIRWeb '07 Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
Transductive link spam detection

AIRWeb '07 Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
Measuring similarity to detect qualified links

AIRWeb '07 Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
A taxonomy of JavaScript redirection spam

AIRWeb '07 Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
Web spam detection via commercial intent analysis

AIRWeb '07 Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
Measuring conference quality by mining program committee characteristics

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Pruning policies for two-tiered inverted index with correctness guarantee

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Know your neighbors: web spam detection using the web topology

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
DiffusionRank: a possible penicillin for web spamming

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Geographic ranking for a local search engine

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Authors vs. readers: a comparative study of document metadata and content in the www

Proceedings of the 2007 ACM symposium on Document engineering
Using word similarity to eradicate junk emails

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Link analysis for Web spam detection

ACM Transactions on the Web (TWEB)
Tracking Web spam with HTML style similarities

ACM Transactions on the Web (TWEB)
Detecting splogs via temporal dynamics using self-similarity analysis

ACM Transactions on the Web (TWEB)
Disorder inequality: a combinatorial approach to nearest neighbor search

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Opinion spam and analysis

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
DirichletRank: Solving the zero-one gap problem of PageRank

ACM Transactions on Information Systems (TOIS)
Exploring social annotations for web document classification

Proceedings of the 2008 ACM symposium on Applied computing
User behavior oriented web spam detection

Proceedings of the 17th international conference on World Wide Web
Improving web spam detection with re-extracted features

Proceedings of the 17th international conference on World Wide Web
Blogosphere: research issues, tools, and applications

ACM SIGKDD Explorations Newsletter
Combating Spamdexing: Incorporating Heuristics in Link-Based Ranking

Algorithms and Models for the Web-Graph
Identifying Spam Web Pages Based on Content Similarity

ICCSA '08 Proceedings of the international conference on Computational Science and Its Applications, Part II
Identifying web spam with user behavior analysis

AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web
Query-log mining for detecting spam

AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web
Cleaning search results using term distance features

AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web
Exploring linguistic features for web spam detection: a preliminary study

AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web
Latent dirichlet allocation in web spam filtering

AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web
Web spam identification through content and hyperlinks

AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web
The anti-social tagger: detecting spam in social bookmarking systems

AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web
Robust PageRank and locally computable spam detection features

AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web
Spam characterization and detection in peer-to-peer file-sharing systems

Proceedings of the 17th ACM conference on Information and knowledge management
Predicting web spam with HTTP session information

Proceedings of the 17th ACM conference on Information and knowledge management
Real-time data pre-processing technique for efficient feature extraction in large scale datasets

Proceedings of the 17th ACM conference on Information and knowledge management
Cost-effective spam detection in p2p file-sharing systems

Proceedings of the 2008 ACM workshop on Large-Scale distributed systems for information retrieval
Configuring topologies of distributed semantic concept classifiers for continuous multimedia stream processing

MM '08 Proceedings of the 16th ACM international conference on Multimedia
Oracle, where shall I submit my papers?

Communications of the ACM - Inspiring Women in Computing
Quality Information Retrieval for the World Wide Web

WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Improvements of HITS Algorithms for Spam Links

IEICE - Transactions on Information and Systems
Fast dynamic reranking in large graphs

Proceedings of the 18th international conference on World wide web
Looking into the past to better classify web spam

Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Web spam filtering in internet archives

Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Web spam identification through language model analysis

Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Linked latent Dirichlet allocation in web spam filtering

Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Web spam challenge proposal for filtering in archives

Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Detecting Link Hijacking by Web Spammers

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Link spam target detection using page farms

ACM Transactions on Knowledge Discovery from Data (TKDD)
A comparison of fraud cues and classification methods for fake escrow website detection

Information Technology and Management
Combinatorial Framework for Similarity Search

SISAP '09 Proceedings of the 2009 Second International Workshop on Similarity Search and Applications
Vetting the links of the web

Proceedings of the 18th ACM conference on Information and knowledge management
TrackBack spam: abuse and prevention

Proceedings of the 2009 ACM workshop on Cloud computing security
Web Spam Identification with User Browsing Graph

AIRS '09 Proceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology
Comment spam injection made easy

CCNC'09 Proceedings of the 6th IEEE Conference on Consumer Communications and Networking Conference
A brief survey of computational approaches in social computing

IJCNN'09 Proceedings of the 2009 international joint conference on Neural Networks
Ads-portal domains: Identification and measurements

ACM Transactions on the Web (TWEB)
Web Crawling

Foundations and Trends in Information Retrieval
Fighting link spam with a two-stage ranking strategy

ECIR'07 Proceedings of the 29th European conference on IR research
Improvements of HITS algorithms for spam links

APWeb/WAIM'07 Proceedings of the joint 9th Asia-Pacific web and 8th international conference on web-age information management conference on Advances in data and web management
Local computation of PageRank contributions

WAW'07 Proceedings of the 5th international conference on Algorithms and models for the web-graph
Improving spamdexing detection via a two-stage classification strategy

AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
Using evidence based content trust model for spam detection

Expert Systems with Applications: An International Journal
Is this a good title?

Proceedings of the 21st ACM conference on Hypertext and hypermedia
Web spam detection: new classification features based on qualified link analysis and language models

IEEE Transactions on Information Forensics and Security
Temporal query log profiling to improve web search ranking

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Finding unusual review patterns using unexpected rules

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Spam detection with a content-based random-walk algorithm

SMUC '10 Proceedings of the 2nd international workshop on Search and mining user-generated contents
Learning to detect web spam by genetic programming

WAIM'10 Proceedings of the 11th international conference on Web-age information management
Design principles for developing stream processing applications

Software—Practice & Experience - Focus on Selected PhD Literature Reviews in the Practical Aspects of Software Technology
Identifying and resolving hidden text salting

IEEE Transactions on Information Forensics and Security
Detecting comment spam through content analysis

WAIM'10 Proceedings of the 2010 international conference on Web-age information management
Quality-biased ranking of web documents

Proceedings of the fourth ACM international conference on Web search and data mining
Let web spammers expose themselves

Proceedings of the fourth ACM international conference on Web search and data mining
Removing web spam links from search engine results

Journal in Computer Virology
Foresighted tree configuration games in resource constrained distributed stream mining sensors

Ad Hoc Networks
Detecting spam blogs from blog search results

Information Processing and Management: an International Journal
Filtering artificial texts with statistical machine learning techniques

Language Resources and Evaluation
Web spam classification: a few features worth more

Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality
Spam detection in online classified advertisements

Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality
Adversarial Web Search

Foundations and Trends in Information Retrieval
Active learning through notes data in Flickr: an effortless training data acquisition approach for object localization

Proceedings of the 1st ACM International Conference on Multimedia Retrieval
Classifying with co-stems: a new representation for information filtering

ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
Detecting malicious web links and identifying their attack types

WebApps'11 Proceedings of the 2nd USENIX conference on Web application development
Finding deceptive opinion spam by any stretch of the imagination

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Cross-lingual text categorization: Conquering language boundaries in globalized environments

Information Processing and Management: an International Journal
Detecting fake websites: the contribution of statistical learning theory

MIS Quarterly
Measuring and analyzing search-redirection attacks in the illicit online prescription drug trade

SEC'11 Proceedings of the 20th USENIX conference on Security
deSEO: combating search-result poisoning

SEC'11 Proceedings of the 20th USENIX conference on Security
Spam detection using web page content: a new battleground

Proceedings of the 8th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference
Link spamming Wikipedia for profit

Proceedings of the 8th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference
Tackling content spamming with a term weighting scheme

Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data
An exploratory analysis of mind maps

Proceedings of the 11th ACM symposium on Document engineering
Autonomous link spam detection in purely collaborative environments

Proceedings of the 7th International Symposium on Wikis and Open Collaboration
Reclaiming the blogosphere, talkback: a secure linkback protocol for weblogs

ESORICS'11 Proceedings of the 16th European conference on Research in computer security
Using patterns in the behavior of the random surfer to detect webspam beneficiaries

WISS'10 Proceedings of the 2010 international conference on Web information systems engineering
SURF: detecting and measuring search poisoning

Proceedings of the 18th ACM conference on Computer and communications security
Web Spam Detection by Exploring Densely Connected Subgraphs

WI-IAT '11 Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Webspam demotion: Low complexity node aggregation methods

Neurocomputing
Automatic Moderation of Online Discussion Sites

International Journal of Electronic Commerce
Text mining and probabilistic language modeling for online review spam detection

ACM Transactions on Management Information Systems (TMIS)
Identifying Web Spam with the Wisdom of the Crowds

ACM Transactions on the Web (TWEB)
Index ordering by query-independent measures

Information Processing and Management: an International Journal
Spam filtering in twitter using sender-receiver relationship

RAID'11 Proceedings of the 14th international conference on Recent Advances in Intrusion Detection
Spotting fake reviewer groups in consumer reviews

Proceedings of the 21st international conference on World Wide Web
Survey on web spam detection: principles and algorithms

ACM SIGKDD Explorations Newsletter
Evaluating Arabic spam classifiers using link analysis

Proceedings of the 3rd International Conference on Information and Communication Systems
Content-based analysis to detect Arabic web spam

Journal of Information Science
Mining user dwell time for personalized web search re-ranking

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Three
Fighting against web spam: a novel propagation method based on click-through data

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Detecting social spam campaigns on twitter

ACNS'12 Proceedings of the 10th international conference on Applied Cryptography and Network Security
Who is Retweeting the Tweeters? Modeling, Originating, and Promoting Behaviors in the Twitter Network

ACM Transactions on Management Information Systems (TMIS)
Analysis and detection of web spam by means of web content

IRFC'12 Proceedings of the 5th conference on Multidisciplinary Information Retrieval
Detecting Fake Medical Web Sites Using Recursive Trust Labeling

ACM Transactions on Information Systems (TOIS)
Observing facial expressions and gaze positions for personalized webpage recommendation

Proceedings of the 12th International Conference on Electronic Commerce: Roadmap for the Future of Electronic Business
Using site-level connections to estimate link confidence

Journal of the American Society for Information Science and Technology
Knowledge acquisition from many-attribute data by genetic programming with clustered terminal symbols

International Journal of Knowledge and Web Intelligence
NCDawareRank: a novel ranking method that exploits the decomposable structure of the web

Proceedings of the sixth ACM international conference on Web search and data mining
A Self-Supervised Approach to Comment Spam Detection Based on Content Analysis

International Journal of Information Security and Privacy
Detecting Webspam Beneficiaries Using Information Collected by the Random Surfer

International Journal of Organizational and Collective Intelligence
Automatic seed set expansion for trust propagation based anti-spam algorithms

Information Sciences: an International Journal
Effectively Detecting Content Spam on the Web Using Topical Diversity Measures

WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Term level search result diversification

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Cost-sensitive online active learning with application to malicious URL detection

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Cross-lingual web spam classification

Proceedings of the 22nd international conference on World Wide Web companion
Automatically generated spam detection based on sentence-level topic information

Proceedings of the 22nd international conference on World Wide Web companion
Ranking fraud detection for mobile apps: a holistic view

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Combating Web spam through trust-distrust propagation with confidence

Pattern Recognition Letters
Shady paths: leveraging surfing crowds to detect malicious web pages

Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security
SAAD, a content based Web Spam Analyzer and Detector

Journal of Systems and Software
Cross-modal social image clustering and tag cleansing

Journal of Visual Communication and Image Representation
Campaign extraction from social media

ACM Transactions on Intelligent Systems and Technology (TIST) - Special Section on Intelligent Mobile Knowledge Discovery and Management Systems and Special Issue on Social Web Mining
Solving graph data issues using a layered architecture approach with applications to web spam detection

Neural Networks
Towards improving the online shopping experience: A client-based platform for post-processing Web search results

Web Intelligence and Agent Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we continue our investigations of "web spam": the injection of artificially-created pages into the web in order to influence the results from search engines, to drive traffic to certain pages for fun or profit. This paper considers some previously-undescribed techniques for automatically detecting spam pages, examines the effectiveness of these techniques in isolation and when aggregated using classification algorithms. When combined, our heuristics correctly identify 2,037 (86.2%) of the 2,364 spam pages (13.8%) in our judged collection of 17,168 pages, while misidentifying 526 spam and non-spam pages (3.1%).