How the minotaur turned into ariadne: ontologies in web data extraction

Authors:
Tim Furche;Georg Gottlob;Xiaonan Guo;Christian Schallhart;Andrew Sellers;Cheng Wang
Affiliations:
Department of Computer Science, University of Oxford, UK;Department of Computer Science, University of Oxford, UK;Department of Computer Science, University of Oxford, UK;Department of Computer Science, University of Oxford, UK;Department of Computer Science, University of Oxford, UK;Department of Computer Science, University of Oxford, UK
Venue:
ICWE'11 Proceedings of the 11th international conference on Web engineering
Year:
2011

Citing 12
Cited 1

A brief survey of web data extraction tools

ACM SIGMOD Record
Crawling the Hidden Web

Proceedings of the 27th International Conference on Very Large Data Bases
Conditional XPath

ACM Transactions on Database Systems (TODS) - Special Issue: SIGMOD/PODS 2004
The Semantic Web Revisited

IEEE Intelligent Systems
A Survey of Web Information Extraction Systems

IEEE Transactions on Knowledge and Data Engineering
Data Quality: Concepts, Methodologies and Techniques (Data-Centric Systems and Applications)

Data Quality: Concepts, Methodologies and Techniques (Data-Centric Systems and Applications)
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Learning to extract form labels

Proceedings of the VLDB Endowment
ODE: Ontology-assisted data extraction

ACM Transactions on Database Systems (TODS)
A hierarchical approach to model web query interfaces for web source integration

Proceedings of the VLDB Endowment
Taking the OXPath down the deep web

Proceedings of the 14th International Conference on Extending Database Technology
OXPath: little language, little memory, great value

Proceedings of the 20th international conference companion on World wide web

DIADEM: domain-centric, intelligent, automated data extraction methodology

Proceedings of the 21st international conference companion on World Wide Web

Quantified Score

Hi-index	0.00

Visualization

Abstract

Humans require automated support to profit from the wealth of data nowadays available on the web. To that end, the linked open data initiative and others have been asking data providers to publish structured, semantically annotated data. Small data providers, such as most UK real-estate agencies, however, are overburdened with this task-- often just starting to move from simple, table- or list-like directories to web applications with rich interfaces. We argue that fully automated extraction of structured data can help resolve this dilemma. Ironically, automated data extraction has seen a recent revival thanks to ontologies and linked open data to guide data extraction. First results from the DIADEM project illustrate that high quality, fully automated data extraction at a web scale is possible, if we combine domain ontologies with a phenomenology describing the representation of domain concepts. We briefly summarise the DIADEM project and discuss a few preliminary results.