The anatomy of a large-scale hypertextual Web search engine
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Data mining: practical machine learning tools and techniques with Java implementations
Data mining: practical machine learning tools and techniques with Java implementations
Authoritative sources in a hyperlinked environment
Journal of the ACM (JACM)
Learning to construct knowledge bases from the World Wide Web
Artificial Intelligence - Special issue on Intelligent internet systems
DEADLINER: building a new niche search engine
Proceedings of the ninth international conference on Information and knowledge management
ACM SIGKDD Explorations Newsletter
Intelligent crawling on the World Wide Web with arbitrary predicates
Proceedings of the 10th international conference on World Wide Web
Accelerated focused crawling through online relevance feedback
Proceedings of the 11th international conference on World Wide Web
Automatic Ontology-Based Knowledge Extraction from Web Documents
IEEE Intelligent Systems
Using Reinforcement Learning to Spider the Web Efficiently
ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
X-means: Extending K-means with Efficient Estimation of the Number of Clusters
ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Focused Crawling Using Context Graphs
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Fast Algorithms for Mining Association Rules in Large Databases
VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
PEBL: positive example based learning for Web page classification using SVM
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
SemTag and seeker: bootstrapping the semantic web via automated semantic annotation
WWW '03 Proceedings of the 12th international conference on World Wide Web
Crawling for Domain-Speci.c Hidden Web Resources
WISE '03 Proceedings of the Fourth International Conference on Web Information Systems Engineering
Gimme' the context: context-driven automatic semantic annotation with C-PANKOW
WWW '05 Proceedings of the 14th international conference on World Wide Web
Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews
ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Unsupervised named-entity extraction from the web: an experimental study
Artificial Intelligence
Visually guided bottom-up table detection and segmentation in web documents
Proceedings of the 15th international conference on World Wide Web
Do not crawl in the DUST: different URLs with similar text
Proceedings of the 15th international conference on World Wide Web
To search or to crawl?: towards a query optimizer for text-centric tasks
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Extracting product features and opinions from reviews
HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Towards domain-independent information extraction from web tables
Proceedings of the 16th international conference on World Wide Web
The discoverability of the web
Proceedings of the 16th international conference on World Wide Web
An Integrated Environment for the Development of Knowledge-Based Recommender Applications
International Journal of Electronic Commerce
Clustering web documents with tables for information extraction
Proceedings of the 4th international conference on Knowledge capture
xCrawl: A High-Recall Crawling Method for Web Mining
ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
ALLRIGHT: automatic ontology instantiation from tabular web documents
ISWC'07/ASWC'07 Proceedings of the 6th international The semantic web and 2nd Asian conference on Asian semantic web conference
A general diagnosis method for ontologies
ISWC'05 Proceedings of the 4th international conference on The Semantic Web
A string metric for ontology alignment
ISWC'05 Proceedings of the 4th international conference on The Semantic Web
Using ontologies for extracting product features from web pages
ISWC'06 Proceedings of the 5th international conference on The Semantic Web
BioOntoVerb: A top level ontology based framework to populate biomedical ontologies from texts
Knowledge-Based Systems
Schema extraction for tabular data on the web
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
The process of populating an ontology-based system with high-quality and up-to-date instance information can be both time-consuming and prone to error. In many domains, however, one possible solution to this problem is to automate the instantiation process for a given ontology by searching (mining) the web for the required instance information. The primary challenges facing such system include: (a) efficiently locating web pages that most probably contain the desired instance information, (b) extracting the instance information from a page, and (c) clustering documents that describe the same instance in order to exploit data redundancy on the web and thus improve the overall quality of the harvested data. In addition, these steps should require as little seed knowledge as possible. In this paper, the AllRight ontology instantiation system is presented, which supports the full instantiation life-cycle and addresses the above-mentioned challenges through a combination of new and existing techniques. In particular the system was designed to deal with situations where the instance information is given in tabular form. The main innovative pillars of the system are a new high-recall focused crawling technique (xCrawl), a novel table recognition algorithm, innovative methods for document clustering and instance name recognition, as well as techniques for fact extraction, instance generation and query-based fact validation. The successful evaluation of the system in different real-world application scenarios shows that the ontology instantiation process can be successfully automated using only a very limited amount of seed knowledge.