Hidden-Web induced by client-side scripting: an empirical study

Authors:
Zahra Behfarshad;Ali Mesbah
Affiliations:
University of British Columbia, Vancouver, BC, Canada;University of British Columbia, Vancouver, BC, Canada
Venue:
ICWE'13 Proceedings of the 13th international conference on Web Engineering
Year:
2013

Citing 17
Cited 1

Crawling the Hidden Web

Proceedings of the 27th International Conference on Very Large Data Bases
Automatic generation of agents for collecting hidden web pages for data extraction

Data & Knowledge Engineering - Special issue: WIDM 2002
Client-Side Deep Web Data Extraction

CEC-EAST '04 Proceedings of the E-Commerce Technology for Dynamic E-Business, IEEE International Conference
SmartCrawl: a new strategy for the exploration of the hidden web

Proceedings of the 6th annual ACM international workshop on Web information and data management
Structured databases on the web: observations and implications

ACM SIGMOD Record
Cat and mouse: content delivery tradeoffs in web access

Proceedings of the 15th international conference on World Wide Web
Data management projects at Google

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Accessing the deep web

Communications of the ACM - ACM at sixty: a look back in time
The discoverability of the web

Proceedings of the 16th international conference on World Wide Web
An adaptive crawler for locating hidden-Web entry points

Proceedings of the 16th international conference on World Wide Web
Google's Deep Web crawl

Proceedings of the VLDB Endowment
Characterizing insecure javascript practices on the web

Proceedings of the 18th international conference on World wide web
AJAX Crawl: Making AJAX Applications Searchable

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
WEBDIFF: Automated identification of cross-browser issues in web applications

ICSM '10 Proceedings of the 2010 IEEE International Conference on Software Maintenance
Crawling Ajax-Based Web Applications through Dynamic Analysis of User Interface State Changes

ACM Transactions on the Web (TWEB)
Invariant-Based Automatic Testing of Modern Web Applications

IEEE Transactions on Software Engineering
Automated analysis of CSS rules to support style maintenance

Proceedings of the 34th International Conference on Software Engineering

A brief history of web crawlers

CASCON '13 Proceedings of the 2013 Conference of the Center for Advanced Studies on Collaborative Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

Client-side JavaScript is increasingly used for enhancing web application functionality, interactivity, and responsiveness. Through the execution of JavaScript code in browsers, the DOM tree representing a webpage at runtime, can be incrementally updated without requiring a URL change. This dynamically updated content is hidden from general search engines. In this paper, we present the first empirical study on measuring and characterizing the hidden-web induced as a result of clientside JavaScript execution. Our study reveals that this type of hidden-web content is prevalent in online web applications today: from the 500 websites we analyzed, 95% contain client-side hidden-web content; On those websites that contain client-side hidden-web content, (1) on average, 62% of the web states are hidden, (2) per hidden state, there is an average of 19 kilobytes of data that is hidden from which 0.6 kilobytes contain textual content, (3) the DIV element is the most common clickable element used (61%) to initiate this type of hidden-web state transition, and (4) on average 25 minutes is required to dynamically crawl 50 DOM states. Further, our study indicates that there is a correlation between DOM tree size and hidden-web content, but no correlation exists between the amount of JavaScript code and client-side hidden-web.