Approximate nearest neighbors: towards removing the curse of dimensionality
STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Exploring the similarity space
ACM SIGIR Forum
Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
Similarity estimation techniques from rounding algorithms
STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Template detection via data mining and its applications
Proceedings of the 11th international conference on World Wide Web
Information Retrieval
On the Resemblance and Containment of Documents
SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Winnowing: local algorithms for document fingerprinting
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Spam, damn spam, and statistics: using statistical analysis to locate spam web pages
Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004
LSH forest: self-tuning indexes for similarity search
WWW '05 Proceedings of the 14th international conference on World Wide Web
Detecting phrase-level duplication on the world wide web
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Detecting spam web pages through content analysis
Proceedings of the 15th international conference on World Wide Web
Template detection for large scale search engines
Proceedings of the 2006 ACM symposium on Applied computing
Finding near-duplicate web pages: a large-scale evaluation of algorithms
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
A reference collection for web spam
ACM SIGIR Forum
Page-level template detection via isotonic smoothing
Proceedings of the 16th international conference on World Wide Web
Finding similar files in a large file system
WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
Know your neighbors: web spam detection using the web topology
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Combating web spam with trustrank
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Looking into the past to better classify web spam
Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Proceedings of the 19th international conference on World wide web
Detecting spam blogs from blog search results
Information Processing and Management: an International Journal
Filtering artificial texts with statistical machine learning techniques
Language Resources and Evaluation
Spam detection in online classified advertisements
Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality
Foundations and Trends in Information Retrieval
Theory and applications of b-bit minwise hashing
Communications of the ACM
deSEO: combating search-result poisoning
SEC'11 Proceedings of the 20th USENIX conference on Security
SURF: detecting and measuring search poisoning
Proceedings of the 18th ACM conference on Computer and communications security
FindCredPg: a novel method to find credible pages based on trust web graph
APWeb'12 Proceedings of the 14th Asia-Pacific international conference on Web Technologies and Applications
Fighting against web spam: a novel propagation method based on click-through data
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Detecting Fake Medical Web Sites Using Recursive Trust Labeling
ACM Transactions on Information Systems (TOIS)
Shady paths: leveraging surfing crowds to detect malicious web pages
Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security
b-bit minwise hashing in practice
Proceedings of the 5th Asia-Pacific Symposium on Internetware
Efficient estimation for high similarities using odd sketches
Proceedings of the 23rd international conference on World wide web
Hi-index | 0.02 |
Automatically generated content is ubiquitous in the web: dynamic sites built using the three-tier paradigm are good examples (e.g., commercial sites, blogs and other sites edited using web authoring software), as well as less legitimate spamdexing attempts (e.g., link farms, faked directories). Those pages built using the same generating method (template or script) share a common “look and feel” that is not easily detected by common text classification methods, but is more related to stylometry. In this work we study and compare several HTML style similarity measures based on both textual and extra-textual features in HTML source code. We also propose a flexible algorithm to cluster a large collection of documents according to these measures. Since the proposed algorithm is based on locality sensitive hashing (LSH), we first review this technique. We then describe how to use the HTML style similarity clusters to pinpoint dubious pages and enhance the quality of spam classifiers. We present an evaluation of our algorithm on the WEBSPAM-UK2006 dataset.