PUBCRAWL: protecting users and businesses from CRAWLers

Authors:
Gregoire Jacob;Engin Kirda;Christopher Kruegel;Giovanni Vigna
Affiliations:
University of California, Santa Barbara, Telecom SudParis;Northeastern University;University of California, Santa Barbara, Telecom SudParis;University of California, Santa Barbara
Venue:
Security'12 Proceedings of the 21st USENIX conference on Security symposium
Year:
2012

Citing 10
Cited 1

Discovery of Web Robot Sessions Based on their Navigational Patterns

Data Mining and Knowledge Discovery
The CN2 Induction Algorithm

Machine Learning
On the need for time series data mining benchmarks: a survey and empirical demonstration

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Crawler Detection: A Bayesian Approach

ICISP '06 Proceedings of the International Conference on Internet Surveillance and Protection
Discovering New Trends in Web Robot Traffic Through Functional Classification

NCA '08 Proceedings of the 2008 Seventh IEEE International Symposium on Network Computing and Applications
An investigation of web crawler behavior: characterization and metrics

Computer Communications
Clustering of time series data-a survey

Pattern Recognition
Web robot detection techniques: overview and limitations

Data Mining and Knowledge Discovery
The Failure of Noise-Based Non-continuous Audio Captchas

SP '11 Proceedings of the 2011 IEEE Symposium on Security and Privacy
Characterizing the variability of arrival processes with indexes of dispersion

IEEE Journal on Selected Areas in Communications

A comparison of web robot and human requests

Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Web crawlers are automated tools that browse the web to retrieve and analyze information. Although crawlers are useful tools that help users to find content on the web, they may also be malicious. Unfortunately, unauthorized (malicious) crawlers are increasingly becoming a threat for service providers because they typically collect information that attackers can abuse for spamming, phishing, or targeted attacks. In particular, social networking sites are frequent targets of malicious crawling, and there were recent cases of scraped data sold on the black market and used for blackmailing. In this paper, we introduce PUBCRAWL, a novel approach for the detection and containment of crawlers. Our detection is based on the observation that crawler traffic significantly differs from user traffic, even when many users are hidden behind a single proxy. Moreover, we present the first technique for crawler campaign attribution that discovers synchronized traffic coming from multiple hosts. Finally, we introduce a containment strategy that leverages our detection results to efficiently block crawlers while minimizing the impact on legitimate users. Our experimental results in a large, well-known social networking site (receiving tens of millions of requests per day) demonstrate that PUBCRAWL can distinguish between crawlers and users with high accuracy. We have completed our technology transfer, and the social networking site is currently running PUB-CRAWL in production.