Access patterns for robots and humans in web archives

Authors:
Yasmin A. AlNoamany;Michele C. Weigle;Michael L. Nelson
Affiliations:
Old Dominion University, Norfolk, VA, USA;Old Dominion University, Norfolk, VA, USA;Old Dominion University, Norfolk, VA, USA
Venue:
Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
Year:
2013

Citing 18
Cited 2

Characterizing browsing strategies in the World-Wide Web

Proceedings of the Third International World-Wide Web conference on Technology, tools and applications
Discovery of Web Robot Sessions Based on their Navigational Patterns

Data Mining and Knowledge Discovery
Web usage mining: discovery and applications of usage patterns from Web data

ACM SIGKDD Explorations Newsletter
Using terminological feedback for web search refinement: a log-based study

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
A Framework for the Evaluation of Session Reconstruction Heuristics in Web-Usage Analysis

INFORMS Journal on Computing
Advanced Data Preprocessing for Intersites Web Usage Mining

IEEE Intelligent Systems
Evaluation of crawling policies for a web-repository crawler

Proceedings of the seventeenth conference on Hypertext and hypermedia
Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage

Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage
Combined mining of Web server logs and web contents for classifying user navigation patterns and predicting users' future requests

Data & Knowledge Engineering
LODAP: a log data preprocessor for mining web browsing patterns

AIKED'07 Proceedings of the 6th Conference on 6th WSEAS Int. Conf. on Artificial Intelligence, Knowledge Engineering and Data Bases - Volume 6
Web robot detection: A probabilistic reasoning approach

Computer Networks: The International Journal of Computer and Telecommunications Networking
An investigation of web crawler behavior: characterization and metrics

Computer Communications
How are we searching the World Wide Web? A comparison of nine search engine transaction logs

Information Processing and Management: an International Journal - Special issue: Formal methods for information retrieval
A characterization of online browsing behavior

Proceedings of the 19th international conference on World wide web
Web robot detection techniques: overview and limitations

Data Mining and Knowledge Discovery
Preprocessing the web server logs: an illustrative approach for effective usage mining

ACM SIGSOFT Software Engineering Notes
Language intent models for inferring user browsing behavior

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
A Web Crawler Detection Algorithm Based on Web Page Member List

IHMSC '12 Proceedings of the 2012 4th International Conference on Intelligent Human-Machine Systems and Cybernetics - Volume 01

Evaluating sliding and sticky target policies by measuring temporal drift in acyclic walks through a web archive

Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
ArcLink: optimization techniques to build and retrieve the temporal web graph

Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries

Quantified Score

Hi-index	0.00

Visualization

Abstract

Although user access patterns on the live web are well-understood, there has been no corresponding study of how users, both humans and robots, access web archives. Based on samples from the Internet Archive's public Wayback Machine, we propose a set of basic usage patterns: Dip (a single access), Slide (the same page at different archive times), Dive (different pages at approximately the same archive time), and Skim (lists of what pages are archived, i.e., TimeMaps). Robots are limited almost exclusively to Dips and Skims, but human accesses are more varied between all four types. Robots outnumber humans 10:1 in terms of sessions, 5:4 in terms of raw HTTP accesses, and 4:1 in terms of megabytes transferred. Robots almost always access TimeMaps (95% of accesses), but humans predominately access the archived web pages themselves (82% of accesses). In terms of unique archived web pages, there is no overall preference for a particular time, but the recent past (within the last year) shows significant repeat accesses.