OXPath: A language for scalable data extraction, automation, and crawling on the deep web

Authors:
Tim Furche;Georg Gottlob;Giovanni Grasso;Christian Schallhart;Andrew Sellers
Affiliations:
Department of Computer Science, Oxford University, Oxford, UK OX1 3QD;Department of Computer Science, Oxford University, Oxford, UK OX1 3QD;Department of Computer Science, Oxford University, Oxford, UK OX1 3QD;Department of Computer Science, Oxford University, Oxford, UK OX1 3QD;Department of Computer Science, Oxford University, Oxford, UK OX1 3QD
Venue:
The VLDB Journal — The International Journal on Very Large Data Bases
Year:
2013

Citing 32
Cited 2

The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Automating Web navigation with the WebVCR

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
RoadRunner: automatic data extraction from data-intensive web sites

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Mercator: A scalable, extensible Web crawler

World Wide Web
Web macros by example: users managing the WWW of applications

CHI '99 Extended Abstracts on Human Factors in Computing Systems
WebOQL: Restructuring Documents, Databases, and Webs

ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
Building Light-Weight Wrappers for Legacy Web Data-Sources Using W4F

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Visual Web Information Extraction with Lixto

Proceedings of the 27th International Conference on Very Large Data Bases
XPath: Looking Forward

EDBT '02 Proceedings of the Worshops XMLDM, MDDE, and YRWS on XML-Based Data Management and Multimedia Engineering-Revised Papers
The Wargo System: Semi-Automatic Wrapper Generation in Presence of Complex Data Access Modes

DEXA '02 Proceedings of the 13th International Workshop on Database and Expert Systems Applications
A Rule-Based Query Language for HTML

DASFAA '01 Proceedings of the 7th International Conference on Database Systems for Advanced Applications
How to build a WebFountain: An architecture for very large-scale text analytics

IBM Systems Journal
UbiCrawler: a scalable fully distributed web crawler

Software—Practice & Experience
Efficient algorithms for processing XPath queries

ACM Transactions on Database Systems (TODS)
Semantic characterizations of navigational XPath

ACM SIGMOD Record
Automation and customization of rendered web pages

Proceedings of the 18th annual ACM symposium on User interface software and technology
Conditional XPath

ACM Transactions on Database Systems (TODS) - Special Issue: SIGMOD/PODS 2004
A Survey of Web Information Extraction Systems

IEEE Transactions on Knowledge and Data Engineering
L-wrappers: concepts, properties and construction: A declarative approach to data extraction from web sources

Soft Computing - A Fusion of Foundations, Methodologies and Applications - Web intelligence and change discovery
Accessing the deep web

Communications of the ACM - ACM at sixty: a look back in time
Declarative information extraction using datalog with embedded extraction predicates

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
CoScripter: automating & sharing how-to knowledge in the enterprise

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Transcendence: enabling a personal view of the deep web

Proceedings of the 13th international conference on Intelligent user interfaces
Wraplet: Wrapping Your Web Contents with a Lightweight Language

SITIS '07 Proceedings of the 2007 Third International IEEE Conference on Signal-Image Technologies and Internet-Based System
Accessing the deep web: when good ideas go bad

Companion to the 23rd ACM SIGPLAN conference on Object-oriented programming systems languages and applications
WebTables: exploring the power of tables on the web

Proceedings of the VLDB Endowment
XPath leashed

ACM Computing Surveys (CSUR)
End-user programming of mashups with vegemite

Proceedings of the 14th international conference on Intelligent user interfaces
Automating Navigation Sequences in AJAX Websites

ICWE '9 Proceedings of the 9th International Conference on Web Engineering
Open information extraction from the web

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Unsupervised named-entity extraction from the Web: An experimental study

Artificial Intelligence
DIADEM: domain-centric, intelligent, automated data extraction methodology

Proceedings of the 21st international conference companion on World Wide Web

Effective web scraping with OXPath

Proceedings of the 22nd international conference on World Wide Web companion
Strigil: A Framework for Data Extraction in Semi-Structured Web Documents

Proceedings of International Conference on Information Integration and Web-based Applications & Services

Quantified Score

Hi-index	0.00

Visualization

Abstract

The evolution of the web has outpaced itself: A growing wealth of information and increasingly sophisticated interfaces necessitate automated processing, yet existing automation and data extraction technologies have been overwhelmed by this very growth. To address this trend, we identify four key requirements for web data extraction, automation, and (focused) web crawling: (1) interact with sophisticated web application interfaces, (2) precisely capture the relevant data to be extracted, (3) scale with the number of visited pages, and (4) readily embed into existing web technologies. We introduce OXPath as an extension of XPath for interacting with web applications and extracting data thus revealed--matching all the above requirements. OXPath's page-at-a-time evaluation guarantees memory use independent of the number of visited pages, yet remains polynomial in time. We experimentally validate the theoretical complexity and demonstrate that OXPath's resource consumption is dominated by page rendering in the underlying browser. With an extensive study of sublanguages and properties of OXPath, we pinpoint the effect of specific features on evaluation performance. Our experiments show that OXPath outperforms existing commercial and academic data extraction tools by a wide margin.