The anatomy of a large-scale hypertextual Web search engine
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Automating Web navigation with the WebVCR
Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
RoadRunner: automatic data extraction from data-intensive web sites
Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Mercator: A scalable, extensible Web crawler
World Wide Web
Web macros by example: users managing the WWW of applications
CHI '99 Extended Abstracts on Human Factors in Computing Systems
WebOQL: Restructuring Documents, Databases, and Webs
ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
Building Light-Weight Wrappers for Legacy Web Data-Sources Using W4F
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Visual Web Information Extraction with Lixto
Proceedings of the 27th International Conference on Very Large Data Bases
EDBT '02 Proceedings of the Worshops XMLDM, MDDE, and YRWS on XML-Based Data Management and Multimedia Engineering-Revised Papers
The Wargo System: Semi-Automatic Wrapper Generation in Presence of Complex Data Access Modes
DEXA '02 Proceedings of the 13th International Workshop on Database and Expert Systems Applications
A Rule-Based Query Language for HTML
DASFAA '01 Proceedings of the 7th International Conference on Database Systems for Advanced Applications
How to build a WebFountain: An architecture for very large-scale text analytics
IBM Systems Journal
UbiCrawler: a scalable fully distributed web crawler
Software—Practice & Experience
Efficient algorithms for processing XPath queries
ACM Transactions on Database Systems (TODS)
Semantic characterizations of navigational XPath
ACM SIGMOD Record
Automation and customization of rendered web pages
Proceedings of the 18th annual ACM symposium on User interface software and technology
ACM Transactions on Database Systems (TODS) - Special Issue: SIGMOD/PODS 2004
A Survey of Web Information Extraction Systems
IEEE Transactions on Knowledge and Data Engineering
Soft Computing - A Fusion of Foundations, Methodologies and Applications - Web intelligence and change discovery
Communications of the ACM - ACM at sixty: a look back in time
Declarative information extraction using datalog with embedded extraction predicates
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
CoScripter: automating & sharing how-to knowledge in the enterprise
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Transcendence: enabling a personal view of the deep web
Proceedings of the 13th international conference on Intelligent user interfaces
Wraplet: Wrapping Your Web Contents with a Lightweight Language
SITIS '07 Proceedings of the 2007 Third International IEEE Conference on Signal-Image Technologies and Internet-Based System
Accessing the deep web: when good ideas go bad
Companion to the 23rd ACM SIGPLAN conference on Object-oriented programming systems languages and applications
WebTables: exploring the power of tables on the web
Proceedings of the VLDB Endowment
ACM Computing Surveys (CSUR)
End-user programming of mashups with vegemite
Proceedings of the 14th international conference on Intelligent user interfaces
Automating Navigation Sequences in AJAX Websites
ICWE '9 Proceedings of the 9th International Conference on Web Engineering
Open information extraction from the web
IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Unsupervised named-entity extraction from the Web: An experimental study
Artificial Intelligence
DIADEM: domain-centric, intelligent, automated data extraction methodology
Proceedings of the 21st international conference companion on World Wide Web
Effective web scraping with OXPath
Proceedings of the 22nd international conference on World Wide Web companion
Strigil: A Framework for Data Extraction in Semi-Structured Web Documents
Proceedings of International Conference on Information Integration and Web-based Applications & Services
Hi-index | 0.00 |
The evolution of the web has outpaced itself: A growing wealth of information and increasingly sophisticated interfaces necessitate automated processing, yet existing automation and data extraction technologies have been overwhelmed by this very growth. To address this trend, we identify four key requirements for web data extraction, automation, and (focused) web crawling: (1) interact with sophisticated web application interfaces, (2) precisely capture the relevant data to be extracted, (3) scale with the number of visited pages, and (4) readily embed into existing web technologies. We introduce OXPath as an extension of XPath for interacting with web applications and extracting data thus revealed--matching all the above requirements. OXPath's page-at-a-time evaluation guarantees memory use independent of the number of visited pages, yet remains polynomial in time. We experimentally validate the theoretical complexity and demonstrate that OXPath's resource consumption is dominated by page rendering in the underlying browser. With an extensive study of sublanguages and properties of OXPath, we pinpoint the effect of specific features on evaluation performance. Our experiments show that OXPath outperforms existing commercial and academic data extraction tools by a wide margin.