Automated ontology instantiation from tabular web sources-The AllRight system

  • Authors:
  • Dietmar Jannach;Kostyantyn Shchekotykhin;Gerhard Friedrich

  • Affiliations:
  • Technische Universität Dortmund, 44221 Dortmund, Germany;University of Klagenfurt, 9020 Klagenfurt, Austria;University of Klagenfurt, 9020 Klagenfurt, Austria

  • Venue:
  • Web Semantics: Science, Services and Agents on the World Wide Web
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

The process of populating an ontology-based system with high-quality and up-to-date instance information can be both time-consuming and prone to error. In many domains, however, one possible solution to this problem is to automate the instantiation process for a given ontology by searching (mining) the web for the required instance information. The primary challenges facing such system include: (a) efficiently locating web pages that most probably contain the desired instance information, (b) extracting the instance information from a page, and (c) clustering documents that describe the same instance in order to exploit data redundancy on the web and thus improve the overall quality of the harvested data. In addition, these steps should require as little seed knowledge as possible. In this paper, the AllRight ontology instantiation system is presented, which supports the full instantiation life-cycle and addresses the above-mentioned challenges through a combination of new and existing techniques. In particular the system was designed to deal with situations where the instance information is given in tabular form. The main innovative pillars of the system are a new high-recall focused crawling technique (xCrawl), a novel table recognition algorithm, innovative methods for document clustering and instance name recognition, as well as techniques for fact extraction, instance generation and query-based fact validation. The successful evaluation of the system in different real-world application scenarios shows that the ontology instantiation process can be successfully automated using only a very limited amount of seed knowledge.