DIADEM: domain-centric, intelligent, automated data extraction methodology

Authors:
Tim Furche;Georg Gottlob;Giovanni Grasso;Omer Gunes;Xiaoanan Guo;Andrey Kravchenko;Giorgio Orsi;Christian Schallhart;Andrew Sellers;Cheng Wang
Affiliations:
Oxford University, Oxford, United Kingdom;Oxford University, Oxford, United Kingdom;Oxford University, Oxford, United Kingdom;Oxford University, Oxford, United Kingdom;Oxford University, Oxford, United Kingdom;Oxford University, Oxford, United Kingdom;Oxford University, Oxford, United Kingdom;Oxford University, Oxford, United Kingdom;Oxford University, Oxford, United Kingdom;Oxford University, Oxford, United Kingdom
Venue:
Proceedings of the 21st international conference companion on World Wide Web
Year:
2012

Citing 14
Cited 7

Wrapper induction: efficiency and expressiveness

Artificial Intelligence - Special issue on Intelligent internet systems
Automatic information extraction from large websites

Journal of the ACM (JACM)
TextRunner: open information extraction on the web

NAACL-Demonstrations '07 Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations
A hierarchical approach to model web query interfaces for web source integration

Proceedings of the VLDB Endowment
Automatic wrappers for large scale web extraction

Proceedings of the VLDB Endowment
Real understanding of real estate forms

Proceedings of the International Conference on Web Intelligence, Mining and Semantics
Determining relevance of accesses at runtime

Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Ontological queries: Rewriting and optimization

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
How the minotaur turned into ariadne: ontologies in web data extraction

ICWE'11 Proceedings of the 11th international conference on Web engineering
Little knowledge rules the web: domain-centric result page extraction

RR'11 Proceedings of the 5th international conference on Web reasoning and rule systems
Conjunctive query answering in probabilistic datalog+/- ontologies

RR'11 Proceedings of the 5th international conference on Web reasoning and rule systems
Answering threshold queries in probabilistic datalog+/-ontologies

SUM'11 Proceedings of the 5th international conference on Scalable uncertainty management
Proceedings of the First international conference on Datalog Reloaded

Datalog'10 Proceedings of the First international conference on Datalog Reloaded
OPAL: automated form understanding for the deep web

Proceedings of the 21st international conference on World Wide Web

Ontology-based access to probabilistic data with OWL QL

ISWC'12 Proceedings of the 11th international conference on The Semantic Web - Volume Part I
OXPath: A language for scalable data extraction, automation, and crawling on the deep web

The VLDB Journal — The International Journal on Very Large Data Bases
Knowledge harvesting in the big-data era

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Effective web scraping with OXPath

Proceedings of the 22nd international conference on World Wide Web companion
ALFRED: crowd assisted data extraction

Proceedings of the 22nd international conference on World Wide Web companion
A framework for learning web wrappers from the crowd

Proceedings of the 22nd international conference on World Wide Web
Strigil: A Framework for Data Extraction in Semi-Structured Web Documents

Proceedings of International Conference on Information Integration and Web-based Applications & Services

Quantified Score

Hi-index	0.00

Visualization

Abstract

Search engines are the sinews of the web. These sinews have become strained, however: Where the web's function once was a mix of library and yellow pages, it has become the central marketplace for information of almost any kind. We search more and more for objects with specific characteristics, a car with a certain mileage, an affordable apartment close to a good school, or the latest accessory for our phones. Search engines all too often fail to provide reasonable answers, making us sift through dozens of websites with thousands of offers--never to be sure a better offer isn't just around the corner. What search engines are missing is understanding of the objects and their attributes published on websites. Automatically identifying and extracting these objects is akin to alchemy: transforming unstructured web information into highly structured data with near perfect accuracy. With DIADEM we present a formula for this transformation, but at a price: DIADEM identifies and extracts data from a website with high accuracy. The price is that for this task we need to provide DIADEM with extensive knowledge about the ontology and phenomenology of the domain, i.e., about entities (and relations) and about the representation of these entities in the textual, structural, and visual language of a website of this domain. In this demonstration, we demonstrate with a first prototype of DIADEM that, in contrast to alchemists, DIADEM has developed a viable formula.