Tracking Web spam with HTML style similarities

Authors:
Tanguy Urvoy;Emmanuel Chauveau;Pascal Filoche;Thomas Lavergne
Affiliations:
Orange Labs (France Telecom R&D), Lannion cedex, France;Orange Labs (France Telecom R&D), Lannion cedex, France;Orange Labs (France Telecom R&D), Lannion cedex, France;Orange Labs and ENST Paris, Lannion cedex, France
Venue:
ACM Transactions on the Web (TWEB)
Year:
2008

Citing 20
Cited 15

Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Exploring the similarity space

ACM SIGIR Forum
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Similarity estimation techniques from rounding algorithms

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Template detection via data mining and its applications

Proceedings of the 11th international conference on World Wide Web
Information Retrieval

Information Retrieval
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Winnowing: local algorithms for document fingerprinting

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Spam, damn spam, and statistics: using statistical analysis to locate spam web pages

Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004
LSH forest: self-tuning indexes for similarity search

WWW '05 Proceedings of the 14th international conference on World Wide Web
Detecting phrase-level duplication on the world wide web

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Detecting spam web pages through content analysis

Proceedings of the 15th international conference on World Wide Web
Template detection for large scale search engines

Proceedings of the 2006 ACM symposium on Applied computing
Finding near-duplicate web pages: a large-scale evaluation of algorithms

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
MODL: A Bayes optimal discretization method for continuous attributes

Machine Learning
A reference collection for web spam

ACM SIGIR Forum
Page-level template detection via isotonic smoothing

Proceedings of the 16th international conference on World Wide Web
Finding similar files in a large file system

WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
Know your neighbors: web spam detection using the web topology

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Combating web spam with trustrank

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30

Looking into the past to better classify web spam

Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
b-Bit minwise hashing

Proceedings of the 19th international conference on World wide web
Detecting spam blogs from blog search results

Information Processing and Management: an International Journal
Filtering artificial texts with statistical machine learning techniques

Language Resources and Evaluation
Spam detection in online classified advertisements

Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality
Adversarial Web Search

Foundations and Trends in Information Retrieval
Theory and applications of b-bit minwise hashing

Communications of the ACM
deSEO: combating search-result poisoning

SEC'11 Proceedings of the 20th USENIX conference on Security
SURF: detecting and measuring search poisoning

Proceedings of the 18th ACM conference on Computer and communications security
FindCredPg: a novel method to find credible pages based on trust web graph

APWeb'12 Proceedings of the 14th Asia-Pacific international conference on Web Technologies and Applications
Fighting against web spam: a novel propagation method based on click-through data

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Detecting Fake Medical Web Sites Using Recursive Trust Labeling

ACM Transactions on Information Systems (TOIS)
Shady paths: leveraging surfing crowds to detect malicious web pages

Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security
b-bit minwise hashing in practice

Proceedings of the 5th Asia-Pacific Symposium on Internetware
Efficient estimation for high similarities using odd sketches

Proceedings of the 23rd international conference on World wide web

Quantified Score

Hi-index	0.02

Visualization

Abstract

Automatically generated content is ubiquitous in the web: dynamic sites built using the three-tier paradigm are good examples (e.g., commercial sites, blogs and other sites edited using web authoring software), as well as less legitimate spamdexing attempts (e.g., link farms, faked directories). Those pages built using the same generating method (template or script) share a common “look and feel” that is not easily detected by common text classification methods, but is more related to stylometry. In this work we study and compare several HTML style similarity measures based on both textual and extra-textual features in HTML source code. We also propose a flexible algorithm to cluster a large collection of documents according to these measures. Since the proposed algorithm is based on locality sensitive hashing (LSH), we first review this technique. We then describe how to use the HTML style similarity clusters to pinpoint dubious pages and enhance the quality of spam classifiers. We present an evaluation of our algorithm on the WEBSPAM-UK2006 dataset.