Unsupervised named-entity extraction from the web: an experimental study

Authors:
Oren Etzioni;Michael Cafarella;Doug Downey;Ana-Maria Popescu;Tal Shaked;Stephen Soderland;Daniel S. Weld;Alexander Yates
Affiliations:
Department of Computer Science and Engineering, University of Washington, Seattle, WA;Department of Computer Science and Engineering, University of Washington, Seattle, WA;Department of Computer Science and Engineering, University of Washington, Seattle, WA;Department of Computer Science and Engineering, University of Washington, Seattle, WA;Department of Computer Science and Engineering, University of Washington, Seattle, WA;Department of Computer Science and Engineering, University of Washington, Seattle, WA;Department of Computer Science and Engineering, University of Washington, Seattle, WA;Department of Computer Science and Engineering, University of Washington, Seattle, WA
Venue:
Artificial Intelligence
Year:
2005

Citing 30
Cited 91

Some advances in transformation-based part of speech tagging

AAAI '94 Proceedings of the twelfth national conference on Artificial intelligence (vol. 1)
A scalable comparison-shopping agent for the World-Wide Web

AGENTS '97 Proceedings of the first international conference on Autonomous agents
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss

Machine Learning - Special issue on learning with probabilistic representations
Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Learning to classify text from labeled and unlabeled documents

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Learning Information Extraction Rules for Semi-Structured and Free Text

Machine Learning - Special issue on natural language learning
Relational learning of pattern-match rules for information extraction

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Learning dictionaries for information extraction by multi-level bootstrapping

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Snowball: extracting relations from large plain-text collections

DL '00 Proceedings of the fifth ACM conference on Digital libraries
Web-collaborative filtering: recommending music by crawling the Web

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
Learning to construct knowledge bases from the World Wide Web

Artificial Intelligence - Special issue on Intelligent internet systems
Snowball: a prototype system for extracting relations from large text collections

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Scaling question answering to the web

ACM Transactions on Information Systems (TOIS)
A flexible learning system for wrapping tables and lists in HTML documents

Proceedings of the 11th international conference on World Wide Web
Hierarchical Wrapper Induction for Semistructured Information Sources

Autonomous Agents and Multi-Agent Systems
Extracting Patterns and Relations from the World Wide Web

WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
SemTag and seeker: bootstrapping the semantic web via automated semantic annotation

WWW '03 Proceedings of the 12th international conference on World Wide Web
Wrapper induction for information extraction

Wrapper induction for information extraction
Measuring praise and criticism: Inference of semantic orientation from association

ACM Transactions on Information Systems (TOIS)
Web-scale information extraction in knowitall: (preliminary results)

Proceedings of the 13th international conference on World Wide Web
Automatic acquisition of hyponyms from large text corpora

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 2
Learning surface text patterns for a Question Answering system

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Is it the right answer?: exploiting web redundancy for Answer Validation

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Exploiting strong syntactic heuristics and co-training to learn semantic lexicons

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
A bootstrapping method for learning semantic lexicons using extraction pattern contexts

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Can we derive general world knowledge from texts?

HLT '02 Proceedings of the second international conference on Human Language Technology Research
Adaptive information extraction from text by rule induction and generalisation

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2
Efficiently inducing features of conditional random fields

UAI'03 Proceedings of the Nineteenth conference on Uncertainty in Artificial Intelligence

A search engine for natural language applications

WWW '05 Proceedings of the 14th international conference on World Wide Web
Adapting Web information extraction knowledge via mining site-invariant and site-dependent features

ACM Transactions on Internet Technology (TOIT)
Espresso: leveraging generic patterns for automatically harvesting semantic relations

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Ontologizing semantic relations

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Semantic taxonomy induction from heterogenous evidence

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Extracting product features and opinions from reviews

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
KnowItNow: fast, scalable information extraction from the web

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
OPINE: extracting product features and opinions from reviews

HLT-Demo '05 Proceedings of HLT/EMNLP on Interactive Demonstrations
Automatising the learning of lexical patterns: An application to the enrichment of WordNet by extracting semantic relationships from Wikipedia

Data & Knowledge Engineering
DB&IR: both sides now

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
URES: an unsupervised web relation extraction system

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
A redundancy-based method for the extraction of relation instances from the Web

International Journal of Human-Computer Studies
Extracting relevant named entities for automated expense reimbursement

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
An unsupervised method for learning generation dictionaries for spoken dialogue systems by mining user reviews

ACM Transactions on Speech and Language Processing (TSLP)
Machine reading of web text

Proceedings of the 4th international conference on Knowledge capture
Strategies for lifelong knowledge extraction from the web

Proceedings of the 4th international conference on Knowledge capture
Autonomously semantifying wikipedia

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Semantic verification in an online fact seeking environment

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Clustering for unsupervised relation identification

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Building structured web community portals: a top-down, compositional, and incremental approach

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
A relational approach to incrementally extracting and querying structure in unstructured data

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Machine reading at web scale

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Learning non-taxonomic relationships from web documents for domain ontology construction

Data & Knowledge Engineering
Flint: Google-basing the Web

EDBT '08 Proceedings of the 11th international conference on Extending database technology: Advances in database technology
Bringing taxonomic structure to large digital libraries

International Journal of Metadata, Semantics and Ontologies
Pattern-based automatic taxonomy learning from the Web

AI Communications
Automatically refining the wikipedia infobox ontology

Proceedings of the 17th international conference on World Wide Web
Information extraction from Wikipedia: moving down the long tail

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Open information extraction from the web

Communications of the ACM - Surviving the data deluge
Ontology-driven, unsupervised instance population

Web Semantics: Science, Services and Agents on the World Wide Web
Web-Based Lemmatisation of Named Entities

TSD '08 Proceedings of the 11th international conference on Text, Speech and Dialogue
Self-supervised relation extraction from the Web

Knowledge and Information Systems
Web-scale named entity recognition

Proceedings of the 17th ACM conference on Information and knowledge management
Extracting the author of web pages

Proceedings of the 2nd ACM workshop on Information credibility on the web
An unsupervised method for joint information extraction and feature mining across different Web sites

Data & Knowledge Engineering
Harvesting, searching, and ranking knowledge on the web: invited talk

Proceedings of the Second ACM International Conference on Web Search and Data Mining
Database and information-retrieval methods for knowledge discovery

Communications of the ACM - A Direct Path to Dependable Software
The YAGO-NAGA approach to knowledge discovery

ACM SIGMOD Record
Using Wikipedia to bootstrap open information extraction

ACM SIGMOD Record
StatSnowball: a statistical approach to extracting entity relationships

Proceedings of the 18th international conference on World wide web
Measuring the similarity between implicit semantic relations from the web

Proceedings of the 18th international conference on World wide web
Generating complex ontology instances from documents

Journal of Algorithms
Label propagation via bootstrapped support vectors for semantic relation extraction between named entities

Computer Speech and Language
Named entity mining from click-through data using weakly supervised latent dirichlet allocation

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Extracting concept descriptions from the Web: the importance of attributes and values

Proceedings of the 2008 conference on Ontology Learning and Population: Bridging the Gap between Text and Knowledge
Automatically Harvesting and Ontologizing Semantic Relations

Proceedings of the 2008 conference on Ontology Learning and Population: Bridging the Gap between Text and Knowledge
Unsupervised Web-based Automatic Annotation

Proceedings of the 2008 conference on STAIRS 2008: Proceedings of the Fourth Starting AI Researchers' Symposium
Semantic disambiguation of taxonomies

Proceedings of the 2007 conference on Artificial Intelligence Research and Development
Named entity recognition in query

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Natural Language Processing as a Foundation of the Semantic Web

Foundations and Trends in Web Science
Extracting customer knowledge from online consumer reviews: a collaborative-filtering-based opinion sentence identification approach

Proceedings of the 11th International Conference on Electronic Commerce
Exploring models for semantic category verification

Information Systems
Exploring models for semantic category verification

Information Systems
A context pattern induction method for named entity extraction

CoNLL-X '06 Proceedings of the Tenth Conference on Computational Natural Language Learning
Superior and efficient fully unsupervised pattern-based concept acquisition using an unsupervised parser

CoNLL '09 Proceedings of the Thirteenth Conference on Computational Natural Language Learning
Glen, Glenda or Glendale: unsupervised and semi-supervised learning of English noun gender

CoNLL '09 Proceedings of the Thirteenth Conference on Computational Natural Language Learning
Design challenges and misconceptions in named entity recognition

CoNLL '09 Proceedings of the Thirteenth Conference on Computational Natural Language Learning
Machine reading

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
Knowledge integration across multiple texts

Proceedings of the fifth international conference on Knowledge capture
Instance-based ontology population exploiting named-entity substitution

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Translation and extension of concepts across languages

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Analysing Wikipedia and gold-standard corpora for NER training

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Boosting unsupervised relation extraction by using NER

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Unsupervised information extraction approach using graph mutual reinforcement

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Scaling textual inference to the web

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
TextRunner: open information extraction on the web

NAACL-Demonstrations '07 Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations
Harvesting relations from the web: quantifiying the impact of filtering functions

AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 2
Integrating natural language, knowledge representation and reasoning, and analogical processing to learn by reading

AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 2
Knowledge-driven learning and discovery

AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 2
Intelligence in wikipedia

AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 3
Structured generative models for unsupervised named-entity clustering

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Semi-automatic entity set refinement

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Coupling semi-supervised learning of categories and relations

SemiSupLearn '09 Proceedings of the NAACL HLT 2009 Workshop on Semi-Supervised Learning for Natural Language Processing
Unsupervised methods for determining object and relation synonyms on the web

Journal of Artificial Intelligence Research
Open information extraction from the web

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Locating complex named entities in web text

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
BE: a search engine for NLP research

WAC '06 Proceedings of the 2nd International Workshop on Web as Corpus
Web mining for event-based commonsense knowledge using lexico-syntactic pattern matching and semantic role labeling

Expert Systems with Applications: An International Journal
Automated ontology instantiation from tabular web sources-The AllRight system

Web Semantics: Science, Services and Agents on the World Wide Web
Corpus-based semantic lexicon induction with Web-based corroboration

UMSLLS '09 Proceedings of the Workshop on Unsupervised and Minimally Supervised Learning of Lexical Semantics
A probabilistic model of redundancy in information extraction

IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
Finding intermediate entity between two examples on the web

Proceedings of the eleventh international workshop on Web information and data management
Mutual Screening Graph Algorithm: A New Bootstrapping Algorithm for Lexical Acquisition

AIRS '09 Proceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology
Automatic Web Pages Author Extraction

FQAS '09 Proceedings of the 8th International Conference on Flexible Query Answering Systems
From information to knowledge: harvesting entities and relationships from web sources

Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Large scale relation detection

FAM-LbR '10 Proceedings of the NAACL HLT 2010 First International Workshop on Formalisms and Methodology for Learning by Reading
Multi-modal multi-correlation person-centric news retrieval

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Language pyramid and multi-scale text analysis

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
A Framework for Semisupervised Feature Generation and Its Applications in Biomedical Literature Mining

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Boosting relation extraction with limited closed-world knowledge

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Semantic annotation of biomedical literature using google

ICCSA'05 Proceedings of the 2005 international conference on Computational Science and Its Applications - Volume Part III

Quantified Score

Hi-index	0.00

Visualization

Abstract

The KNOWITALL system aims to automate the tedious process of extracting large collections of facts (e.g., names of scientists or politicians) from the Web in an unsupervised, domain-independent, and scalable manner. The paper presents an overview of KNOWITALL's novel architecture and design principles, emphasizing its distinctive ability to extract information without any hand-labeled training examples. In its first major run, KNOWITALL extracted over 50,000 class instances, but suggested a challenge: How can we improve KNOWITALL's recall and extraction rate without sacrificing precision?This paper presents three distinct ways to address this challenge and evaluates their performance. Pattern Learning learns domain-specific extraction rules, which enable additional extractions. Subclass Extraction automatically identifies sub-classes in order to boost recall (e.g., "chemist" and "biologist" are identified as sub-classes of "scientist"). List Extraction locates lists of class instances, learns a "wrapper" for each list, and extracts elements of each list. Since each method bootstraps from KNOWITALL's domain-independent methods, the methods also obviate hand-labeled training examples. The paper reports on experiments, focused on building lists of named entities, that measure the relative efficacy of each method and demonstrate their synergy. In concert, our methods gave KNOWITALL a 4-fold to 8-fold increase in recall at precision of 0.90, and discovered over 10,000 cities missing from the Tipster Gazetteer.