Discovery of Web Robot Sessions Based on their Navigational Patterns

Authors:
Pang-Ning Tan;Vipin Kumar
Affiliations:
Department of Computer Science, University of Minnesota, 200 Union Street SE, Minneapolis, MN 55455, USA. ptan@cs.umn.edu;Department of Computer Science, University of Minnesota, 200 Union Street SE, Minneapolis, MN 55455, USA. kumar@cs.umn.edu
Venue:
Data Mining and Knowledge Discovery
Year:
2002

Citing 10
Cited 48

C4.5: programs for machine learning

C4.5: programs for machine learning
Ethical Web agents

Computer Networks and ISDN Systems
Silk from a sow's ear: extracting usable structures from the Web

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Shopbot economics

Proceedings of the third annual conference on Autonomous Agents
What is actually taking place on web sites: e-commerce lessons from web server logs

Proceedings of the 2nd ACM conference on Electronic commerce
Information Retrieval

Information Retrieval
Technology News

Computer
Keep Your Bots to Yourself

IEEE Software
Web usage mining: discovery and application of interesting patterns from web data

Web usage mining: discovery and application of interesting patterns from web data
Letizia: an agent that assists web browsing

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 1

Mining Indirect Associations in Web Data

WEBKDD '01 Revised Papers from the Third International Workshop on Mining Web Log Data Across All Customers Touch Points
Web Usage Mining as a Tool for Personalization: A Survey

User Modeling and User-Adapted Interaction
Findings from a Practical Project Concerning Web Usage Mining

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
The dark side of the Web: an open proxy's view

ACM SIGCOMM Computer Communication Review
Lessons and Challenges from Mining Retail E-Commerce Data

Machine Learning
Crawling a country: better strategies than breadth-first for web page ordering

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Mining interesting knowledge from weblogs: a survey

Data & Knowledge Engineering
Catching web crawlers in the act

ICWE '06 Proceedings of the 6th international conference on Web engineering
A process of knowledge discovery from web log data: Systematization and critical review

Journal of Intelligent Information Systems
SearchGen: a synthetic workload generator for scientific literature digital libraries and search engines

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Characterizing typical and atypical user sessions in clickstreams

Proceedings of the 17th international conference on World Wide Web
Web robot detection in the scholarly information environment

Journal of Information Science
Computational Intelligence techniques for Web personalization

Web Intelligence and Agent Systems
Controlled experiments on the web: survey and practical guide

Data Mining and Knowledge Discovery
Web robot detection: A probabilistic reasoning approach

Computer Networks: The International Journal of Computer and Telecommunications Networking
Seven pitfalls to avoid when running controlled experiments on the web

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Use of Deception to Improve Client Honeypot Detection of Drive-by-Download Attacks

FAC '09 Proceedings of the 5th International Conference on Foundations of Augmented Cognition. Neuroergonomics and Operational Neuroscience: Held as Part of HCI International 2009
Exploring relevance for clicks

Proceedings of the 18th ACM conference on Information and knowledge management
An investigation of web crawler behavior: characterization and metrics

Computer Communications
Study on the Click Context of Web Search Users for Reliability Analysis

AIRS '09 Proceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology
HoneySpam 2.0: Profiling Web Spambot Behaviour

PRIMA '09 Proceedings of the 12th International Conference on Principles of Practice in Multi-Agent Systems
Identifying web navigation behaviour and patterns automatically from clickstream data

International Journal of Web Engineering and Technology
Data mining for web personalization

The adaptive web
A probabilistic reasoning approach for discovering web crawler sessions

APWeb/WAIM'07 Proceedings of the joint 9th Asia-Pacific web and 8th international conference on web-age information management conference on Advances in data and web management
Filtering of web recommendation lists using positive and negative usage patterns

KES'07/WIRN'07 Proceedings of the 11th international conference, KES 2007 and XVII Italian workshop on neural networks conference on Knowledge-based intelligent information and engineering systems: Part III
Large-scale bot detection for search engines

Proceedings of the 19th international conference on World wide web
HengHa: data harvesting detection on hidden databases

Proceedings of the 2010 ACM workshop on Cloud computing security workshop
A brief survey on sequence classification

ACM SIGKDD Explorations Newsletter
Web robot detection techniques: overview and limitations

Data Mining and Knowledge Discovery
Towards tabbing aware recommendations

Proceedings of the First International Conference on Intelligent Interactive Technologies and Multimedia
Adversarial Web Search

Foundations and Trends in Information Retrieval
Characterizing e-business workloads using fractal methods

Journal of Web Engineering
Crawling the infinite web

Journal of Web Engineering
Finding unexpected navigation behaviour in clickstream data for website design improvement

Journal of Web Engineering
Detecting web crawlers from web server access logs with data mining classifiers

ISMIS'11 Proceedings of the 19th international conference on Foundations of intelligent systems
A pattern restore method for restoring missing patterns in server side clickstream data

APWeb'05 Proceedings of the 7th Asia-Pacific web conference on Web Technologies Research and Development
Research on path clustering based on the access interest of users

AWIC'05 Proceedings of the Third international conference on Advances in Web Intelligence
Evaluation of web robot discovery techniques: a benchmarking study

ICDM'06 Proceedings of the 6th Industrial Conference on Data Mining conference on Advances in Data Mining: applications in Medicine, Web Mining, Marketing, Image and Signal Mining
Behaviour-Based web spambot detection by utilising action time and action frequency

ICCSA'10 Proceedings of the 2010 international conference on Computational Science and Its Applications - Volume Part II
Feature evaluation for web crawler detection with data mining techniques

Expert Systems with Applications: An International Journal
Web robot detection based on pattern-matching technique

Journal of Information Science
Analysis of web logs: challenges and findings

PERFORM'10 Proceedings of the 2010 IFIP WG 6.3/7.3 international conference on Performance Evaluation of Computer and Communication Systems: milestones and future challenges
PUBCRAWL: protecting users and businesses from CRAWLers

Security'12 Proceedings of the 21st USENIX conference on Security symposium
How much money do spammers make from your website?

Proceedings of the CUBE International Information Technology Conference
Detection of fixed length web spambot using REAL (read aligner)

Proceedings of the CUBE International Information Technology Conference
Detection of malicious and non-malicious website visitors using unsupervised neural network learning

Applied Soft Computing
Blog or block: Detecting blog bots through behavioral biometrics

Computer Networks: The International Journal of Computer and Telecommunications Networking
Access patterns for robots and humans in web archives

Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries

Quantified Score

Hi-index	0.00

Visualization

Abstract

Web robots are software programs that automatically traverse the hyperlink structure of the World Wide Web in order to locate and retrieve information. There are many reasons why it is important to identify visits by Web robots and distinguish them from other users. First of all, e-commerce retailers are particularly concerned about the unauthorized deployment of robots for gathering business intelligence at their Web sites. In addition, Web robots tend to consume considerable network bandwidth at the expense of other users. Sessions due to Web robots also make it more difficult to perform clickstream analysis effectively on the Web data. Conventional techniques for detecting Web robots are often based on identifying the IP address and user agent of the Web clients. While these techniques are applicable to many well-known robots, they may not be sufficient to detect camouflaged and previously unknown robots. In this paper, we propose an alternative approach that uses the navigational patterns in the click-stream data to determine if it is due to a robot. Experimental results on our Computer Science department Web server logs show that highly accurate classification models can be built using this approach. We also show that these models are able to discover many camouflaged and previously unidentified robots.