Corroborate and learn facts from the web

Authors:
Shubin Zhao;Jonathan Betz
Affiliations:
Google Inc., New York, NY;Google Inc., New York, NY
Venue:
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2007

Citing 10
Cited 7

IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
A flexible learning system for wrapping tables and lists in HTML documents

Proceedings of the 11th international conference on World Wide Web
Hierarchical Wrapper Induction for Semistructured Information Sources

Autonomous Agents and Multi-Agent Systems
Extracting Patterns and Relations from the World Wide Web

WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Web-scale information extraction in knowitall: (preliminary results)

Proceedings of the 13th international conference on World Wide Web
Experiments with open-domain textual Question Answering

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Learning surface text patterns for a Question Answering system

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Self-supervised relation extraction from the web

ISMIS'06 Proceedings of the 16th international conference on Foundations of Intelligent Systems

Automatic Construction of a Semantic, Domain-Independent Knowledge Base

OTM '09 Proceedings of the Confederated International Workshops and Posters on On the Move to Meaningful Internet Systems: ADI, CAMS, EI2N, ISDE, IWSSA, MONET, OnToContent, ODIS, ORM, OTM Academy, SWWS, SEMELS, Beyond SAWSDL, and COMBEK 2009
Extracting events from wikipedia as RDF triples linked to widespread semantic web datasets

OCSC'11 Proceedings of the 4th international conference on Online communities and social computing
Extract knowledge from semi-structured websites for search task simplification

Proceedings of the 20th ACM international conference on Information and knowledge management
Resource-Bounded information extraction: acquiring missing feature values on demand

PAKDD'10 Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
Heuristic algorithm for extraction of facts using relational model and syntactic data

MICAI'11 Proceedings of the 10th Mexican international conference on Advances in Artificial Intelligence - Volume Part I
Assessing web article quality by harnessing collective intelligence

DASFAA'12 Proceedings of the 17th international conference on Database Systems for Advanced Applications - Volume Part I
Scalable and noise tolerant web knowledge extraction for search task simplification

Decision Support Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The web contains lots of interesting factual information about entities, such as celebrities, movies or products. This paper describes a robust bootstrapping approach to corroborate facts and learn more facts simultaneously. This approach starts with retrieving relevant pages from a crawl repository for each entity in the seed set. In each learning cycle, known facts of an entity are corroborated first in a relevant page to find fact mentions. When fact mentions are found, they are taken as examples for learning new facts from the page via HTML pattern discovery. Extracted new facts are added to the known fact set for the next learning cycle. The bootstrapping process continues until no new facts can be learned. This approach is language-independent. It demonstrated good performance in experiment on country facts. Results of a large scale experiment will also be shown with initial facts imported from wikipedia.