The OXPath to success in the deep web

Authors:
Andrew Jon Sellers
Affiliations:
University of Oxford, Oxford, United Kingdom
Venue:
Proceedings of the 20th international conference companion on World wide web
Year:
2011

Citing 14
Cited 0

Automating Web navigation with the WebVCR

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
A brief survey of web data extraction tools

ACM SIGMOD Record
Web macros by example: users managing the WWW of applications

CHI '99 Extended Abstracts on Human Factors in Computing Systems
Visual Web Information Extraction with Lixto

Proceedings of the 27th International Conference on Very Large Data Bases
The Wargo System: Semi-Automatic Wrapper Generation in Presence of Complex Data Access Modes

DEXA '02 Proceedings of the 13th International Workshop on Database and Expert Systems Applications
A Rule-Based Query Language for HTML

DASFAA '01 Proceedings of the 7th International Conference on Database Systems for Advanced Applications
Monadic datalog and the expressive power of languages for Web information extraction

Journal of the ACM (JACM)
Automation and customization of rendered web pages

Proceedings of the 18th annual ACM symposium on User interface software and technology
Conditional XPath

ACM Transactions on Database Systems (TODS) - Special Issue: SIGMOD/PODS 2004
Robust web content extraction

Proceedings of the 15th international conference on World Wide Web
A Survey of Web Information Extraction Systems

IEEE Transactions on Knowledge and Data Engineering
Transcendence: enabling a personal view of the deep web

Proceedings of the 13th international conference on Intelligent user interfaces
Accessing the deep web: when good ideas go bad

Companion to the 23rd ACM SIGPLAN conference on Object-oriented programming systems languages and applications
Automating Navigation Sequences in AJAX Websites

ICWE '9 Proceedings of the 9th International Conference on Web Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

The world wide web provides access to a wealth of data. Collecting and maintaining such large amounts of data necessitates automated processing for extraction, since appropriate automation can perform extraction tasks that would be otherwise infeasible. Modern web interfaces, however, are generally designed primarily for human users, delivering sophisticated interactions through the use of client-side scripting and asynchronous server communication. To this end, we introduce OXPath, a careful extension of XPath that facilitates data extraction from the deep web. OXPath exploits XPath's familiarity and theoretical foundations. OXPath, then, achieves favourable evaluation complexity and optimal page buffering, storing only a constant number of pages for non-recursive queries. Further, OXPath provides a lightweight interface, which is easy to use and embed. This paper outlines the motivation, theoretical framework, current implementation, and preliminary results obtained so far. We conclude with proposed future work on OXPath, including an investigation of how to deploy OXPath efficiently in a highly elastic computing framework (cloud).