Automated ontology instantiation from tabular web sources-The AllRight system

Authors:
Dietmar Jannach;Kostyantyn Shchekotykhin;Gerhard Friedrich
Affiliations:
Technische Universität Dortmund, 44221 Dortmund, Germany;University of Klagenfurt, 9020 Klagenfurt, Austria;University of Klagenfurt, 9020 Klagenfurt, Austria
Venue:
Web Semantics: Science, Services and Agents on the World Wide Web
Year:
2009

Citing 32
Cited 2

The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Data mining: practical machine learning tools and techniques with Java implementations

Data mining: practical machine learning tools and techniques with Java implementations
Authoritative sources in a hyperlinked environment

Journal of the ACM (JACM)
Learning to construct knowledge bases from the World Wide Web

Artificial Intelligence - Special issue on Intelligent internet systems
DEADLINER: building a new niche search engine

Proceedings of the ninth international conference on Information and knowledge management
Web mining research: a survey

ACM SIGKDD Explorations Newsletter
Intelligent crawling on the World Wide Web with arbitrary predicates

Proceedings of the 10th international conference on World Wide Web
Accelerated focused crawling through online relevance feedback

Proceedings of the 11th international conference on World Wide Web
Automatic Ontology-Based Knowledge Extraction from Web Documents

IEEE Intelligent Systems
Using Reinforcement Learning to Spider the Web Efficiently

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
X-means: Extending K-means with Efficient Estimation of the Number of Clusters

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Focused Crawling Using Context Graphs

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
PEBL: positive example based learning for Web page classification using SVM

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
SemTag and seeker: bootstrapping the semantic web via automated semantic annotation

WWW '03 Proceedings of the 12th international conference on World Wide Web
Crawling for Domain-Speci.c Hidden Web Resources

WISE '03 Proceedings of the Fourth International Conference on Web Information Systems Engineering
Gimme' the context: context-driven automatic semantic annotation with C-PANKOW

WWW '05 Proceedings of the 14th international conference on World Wide Web
Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Unsupervised named-entity extraction from the web: an experimental study

Artificial Intelligence
Visually guided bottom-up table detection and segmentation in web documents

Proceedings of the 15th international conference on World Wide Web
Do not crawl in the DUST: different URLs with similar text

Proceedings of the 15th international conference on World Wide Web
To search or to crawl?: towards a query optimizer for text-centric tasks

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Extracting product features and opinions from reviews

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Towards domain-independent information extraction from web tables

Proceedings of the 16th international conference on World Wide Web
The discoverability of the web

Proceedings of the 16th international conference on World Wide Web
An Integrated Environment for the Development of Knowledge-Based Recommender Applications

International Journal of Electronic Commerce
Clustering web documents with tables for information extraction

Proceedings of the 4th international conference on Knowledge capture
xCrawl: A High-Recall Crawling Method for Web Mining

ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
ALLRIGHT: automatic ontology instantiation from tabular web documents

ISWC'07/ASWC'07 Proceedings of the 6th international The semantic web and 2nd Asian conference on Asian semantic web conference
A general diagnosis method for ontologies

ISWC'05 Proceedings of the 4th international conference on The Semantic Web
A string metric for ontology alignment

ISWC'05 Proceedings of the 4th international conference on The Semantic Web
Using ontologies for extracting product features from web pages

ISWC'06 Proceedings of the 5th international conference on The Semantic Web

BioOntoVerb: A top level ontology based framework to populate biomedical ontologies from texts

Knowledge-Based Systems
Schema extraction for tabular data on the web

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

The process of populating an ontology-based system with high-quality and up-to-date instance information can be both time-consuming and prone to error. In many domains, however, one possible solution to this problem is to automate the instantiation process for a given ontology by searching (mining) the web for the required instance information. The primary challenges facing such system include: (a) efficiently locating web pages that most probably contain the desired instance information, (b) extracting the instance information from a page, and (c) clustering documents that describe the same instance in order to exploit data redundancy on the web and thus improve the overall quality of the harvested data. In addition, these steps should require as little seed knowledge as possible. In this paper, the AllRight ontology instantiation system is presented, which supports the full instantiation life-cycle and addresses the above-mentioned challenges through a combination of new and existing techniques. In particular the system was designed to deal with situations where the instance information is given in tabular form. The main innovative pillars of the system are a new high-recall focused crawling technique (xCrawl), a novel table recognition algorithm, innovative methods for document clustering and instance name recognition, as well as techniques for fact extraction, instance generation and query-based fact validation. The successful evaluation of the system in different real-world application scenarios shows that the ontology instantiation process can be successfully automated using only a very limited amount of seed knowledge.