Methods for domain-independent information extraction from the web: an experimental comparison

Authors:
Oren Etzioni;Michael Cafarella;Doug Downey;Ana-Maria Popescu;Tal Shaked;Stephen Soderland;Daniel S. Weld;Alexander Yates
Affiliations:
Department of Computer Science and Engineering, University of Washington, Seattle, WA;Department of Computer Science and Engineering, University of Washington, Seattle, WA;Department of Computer Science and Engineering, University of Washington, Seattle, WA;Department of Computer Science and Engineering, University of Washington, Seattle, WA;Department of Computer Science and Engineering, University of Washington, Seattle, WA;Department of Computer Science and Engineering, University of Washington, Seattle, WA;Department of Computer Science and Engineering, University of Washington, Seattle, WA;Department of Computer Science and Engineering, University of Washington, Seattle, WA
Venue:
AAAI'04 Proceedings of the 19th national conference on Artifical intelligence
Year:
2004

Citing 19
Cited 49

Some advances in transformation-based part of speech tagging

AAAI '94 Proceedings of the twelfth national conference on Artificial intelligence (vol. 1)
A scalable comparison-shopping agent for the World-Wide Web

AGENTS '97 Proceedings of the first international conference on Autonomous agents
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss

Machine Learning - Special issue on learning with probabilistic representations
Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Learning Information Extraction Rules for Semi-Structured and Free Text

Machine Learning - Special issue on natural language learning
Learning dictionaries for information extraction by multi-level bootstrapping

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Snowball: extracting relations from large plain-text collections

DL '00 Proceedings of the fifth ACM conference on Digital libraries
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
Learning to construct knowledge bases from the World Wide Web

Artificial Intelligence - Special issue on Intelligent internet systems
Scaling question answering to the Web

Proceedings of the 10th international conference on World Wide Web
A flexible learning system for wrapping tables and lists in HTML documents

Proceedings of the 11th international conference on World Wide Web
Extracting Patterns and Relations from the World Wide Web

WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Web-scale information extraction in knowitall: (preliminary results)

Proceedings of the 13th international conference on World Wide Web
Automatic acquisition of hyponyms from large text corpora

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 2
Learning surface text patterns for a Question Answering system

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Is it the right answer?: exploiting web redundancy for Answer Validation

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Exploiting strong syntactic heuristics and co-training to learn semantic lexicons

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Moving up the information food chain: deploying softbots on the world wide web

AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 2
Efficiently inducing features of conditional random fields

UAI'03 Proceedings of the Nineteenth conference on Uncertainty in Artificial Intelligence

Designing interfaces for guided collection of knowledge about everyday objects from volunteers

Proceedings of the 10th international conference on Intelligent user interfaces
Learning by googling

ACM SIGKDD Explorations Newsletter
Gimme' the context: context-driven automatic semantic annotation with C-PANKOW

WWW '05 Proceedings of the 14th international conference on World Wide Web
Improving the design of intelligent acquisition interfaces for collecting world knowledge from web contributors

Proceedings of the 3rd international conference on Knowledge capture
Learning Object Models from Semistructured Web Documents

IEEE Transactions on Knowledge and Data Engineering
Estimating required recall for successful knowledge acquisition from the web

Proceedings of the 15th international conference on World Wide Web
Towards domain-independent information extraction from web tables

Proceedings of the 16th international conference on World Wide Web
Exploiting types for improved schema mapping

Proceedings of the 2007 ACM symposium on Applied computing
Magpie: Experiences in supporting Semantic Web browsing

Web Semantics: Science, Services and Agents on the World Wide Web
Discovering semantic biomedical relations utilizing the Web

ACM Transactions on Knowledge Discovery from Data (TKDD)
Relation discovery from web data for competency management

Web Intelligence and Agent Systems
Transcendence: enabling a personal view of the deep web

Proceedings of the 13th international conference on Intelligent user interfaces
Text Retrieval Oriented Auto-construction of Conceptual Relationship

ICCS '07 Proceedings of the 7th international conference on Computational Science, Part II
Hyponymy Patterns

TSD '08 Proceedings of the 11th international conference on Text, Speech and Dialogue
Automatic Extraction of Pedagogic Metadata from Learning Content

International Journal of Artificial Intelligence in Education
Query based optimal web site clustering using simulated annealing

Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
Improving Relation Extraction by Exploiting Properties of the Target Relation

SSDBM 2009 Proceedings of the 21st International Conference on Scientific and Statistical Database Management
Mining the web for reciprocal relationships

CoNLL '09 Proceedings of the Thirteenth Conference on Computational Natural Language Learning
Table extraction using spatial reasoning on the CSS2 visual box model

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
Sensor-based understanding of daily life via large-scale use of common sense

AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
Unsupervised activity recognition using automatically mined common sense

AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 1
An analysis of knowledge collected from volunteer contributors

AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 2
Structure learning on large scale common sense statistical models of human state

AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 3
Cross-domain activity recognition

Proceedings of the 11th international conference on Ubiquitous computing
UVAVU: WordNet similarity and lexical patterns for semantic relation classification

SemEval '07 Proceedings of the 4th International Workshop on Semantic Evaluations
Semi-supervised learning of semantic classes for query understanding: from the web and for the web

Proceedings of the 18th ACM conference on Information and knowledge management
ExSearch: a novel vertical search engine for online barter business

Proceedings of the 18th ACM conference on Information and knowledge management
Semantic Web Mining

Web Semantics: Science, Services and Agents on the World Wide Web
Modeling parametric web arc weight measurement

ICCSA'07 Proceedings of the 2007 international conference on Computational science and its applications - Volume Part III
Analysis of a probabilistic model of redundancy in unsupervised information extraction

Artificial Intelligence
Learning 5000 relational extractors

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Extracting sequences from the web

ACLShort '10 Proceedings of the ACL 2010 Conference Short Papers
Self-supervised mining of human activity from CGM

PKAW'10 Proceedings of the 11th international conference on Knowledge management and acquisition for smart systems and services
Human activity mining using conditional radom fields and self-supervised learning

ACIIDS'10 Proceedings of the Second international conference on Intelligent information and database systems: Part I
Searching patterns for relation extraction over the web: rediscovering the pattern-relation duality

Proceedings of the fourth ACM international conference on Web search and data mining
Capturing users' buying activity at Akihabara electric town from twitter

ICCCI'10 Proceedings of the Second international conference on Computational collective intelligence: technologies and applications - Volume Part II
Modeling reciprocity in social interactions with probabilistic latent space models

Natural Language Engineering
Cross-domain activity recognition via transfer learning

Pervasive and Mobile Computing
Artificial intelligence arrives to the 21st century

MICAI'06 Proceedings of the 5th Mexican international conference on Artificial Intelligence
Study on integrating semantic applications with magpie

AIMSA'06 Proceedings of the 12th international conference on Artificial Intelligence: methodology, Systems, and Applications
Self-supervised capturing of users' activities from weblogs

International Journal of Intelligent Information and Database Systems
A method for learning part-whole relations

ISWC'06 Proceedings of the 5th international conference on The Semantic Web
ACTraversal: ranking crowdsourced commonsense assertions and certifications

PRIMA'11 Proceedings of the 14th international conference on Agents in Principle, Agents in Practice
Capability modeling of knowledge-based agents for commonsense knowledge integration

PRIMA'11 Proceedings of the 14th international conference on Agents in Principle, Agents in Practice
Corpus-Driven hyponym acquisition for turkish language

CICLing'12 Proceedings of the 13th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part I
Resource-bounded crowd-sourcing of commonsense knowledge

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Three
The bootstrapping based recognition of conceptual relationship for text retrieval

NLDB'07 Proceedings of the 12th international conference on Applications of Natural Language to Information Systems
Provenance-based dictionary refinement in information extraction

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Extraction of financial information from online business reports

ACM SIGMIS Database

Quantified Score

Hi-index	0.00

Visualization

Abstract

Our KNOWITALL system aims to automate the tedious process of extracting large collections of facts (e.g., names of scientists or politicians) from the Web in an autonomous, domain-independent, and scalable manner. In its first major run, KNOWITALL extracted over 50,000 facts with high precision, but suggested a challenge: How can we improve KNOWITALL's recall and extraction rate without sacrificing precision? This paper presents three distinct ways to address this challenge and evaluates their performance. Rule Learning learns domain-specific extraction rules. Subclass Extraction automatically identifies sub-classes in order to boost recall. List Extraction locates lists of class instances, learns a "wrapper" for each list, and extracts elements of each list. Since each method bootstraps from KNOWITALL's domain-independent methods, no hand-labeled training examples are required. Experiments show the relative coverage of each method and demonstrate their synergy. In concert, our methods gave KNOWITALL a 4-fold to 19-fold increase in recall, while maintaining high precision, and discovered 10,300 cities missing from the Tipster Gazetteer.