DIADEM: domain-centric, intelligent, automated data extraction methodology

  • Authors:
  • Tim Furche;Georg Gottlob;Giovanni Grasso;Omer Gunes;Xiaoanan Guo;Andrey Kravchenko;Giorgio Orsi;Christian Schallhart;Andrew Sellers;Cheng Wang

  • Affiliations:
  • Oxford University, Oxford, United Kingdom;Oxford University, Oxford, United Kingdom;Oxford University, Oxford, United Kingdom;Oxford University, Oxford, United Kingdom;Oxford University, Oxford, United Kingdom;Oxford University, Oxford, United Kingdom;Oxford University, Oxford, United Kingdom;Oxford University, Oxford, United Kingdom;Oxford University, Oxford, United Kingdom;Oxford University, Oxford, United Kingdom

  • Venue:
  • Proceedings of the 21st international conference companion on World Wide Web
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Search engines are the sinews of the web. These sinews have become strained, however: Where the web's function once was a mix of library and yellow pages, it has become the central marketplace for information of almost any kind. We search more and more for objects with specific characteristics, a car with a certain mileage, an affordable apartment close to a good school, or the latest accessory for our phones. Search engines all too often fail to provide reasonable answers, making us sift through dozens of websites with thousands of offers--never to be sure a better offer isn't just around the corner. What search engines are missing is understanding of the objects and their attributes published on websites. Automatically identifying and extracting these objects is akin to alchemy: transforming unstructured web information into highly structured data with near perfect accuracy. With DIADEM we present a formula for this transformation, but at a price: DIADEM identifies and extracts data from a website with high accuracy. The price is that for this task we need to provide DIADEM with extensive knowledge about the ontology and phenomenology of the domain, i.e., about entities (and relations) and about the representation of these entities in the textual, structural, and visual language of a website of this domain. In this demonstration, we demonstrate with a first prototype of DIADEM that, in contrast to alchemists, DIADEM has developed a viable formula.