WebSets: extracting sets of entities from the web using unsupervised information extraction

Authors:
Bhavana Bharat Dalvi;William W. Cohen;Jamie Callan
Affiliations:
Carnegie Mellon University, Pittsburgh, PA, USA;Carnegie Mellon University, Pittsburgh, PA, USA;Carnegie Mellon University, Pittsburgh, PA, USA
Venue:
Proceedings of the fifth ACM international conference on Web search and data mining
Year:
2012

Citing 20
Cited 6

Web-scale information extraction in knowitall: (preliminary results)

Proceedings of the 13th international conference on World Wide Web
Automatic acquisition of hyponyms from large text corpora

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 2
Concept discovery from text

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Towards domain-independent information extraction from web tables

Proceedings of the 16th international conference on World Wide Web
Introduction to Information Retrieval

Introduction to Information Retrieval
WebTables: exploring the power of tables on the web

Proceedings of the VLDB Endowment
Cheap and fast---but is it good?: evaluating non-expert annotations for natural language tasks

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Weakly-supervised acquisition of labeled class instances using graph random walks

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
TextRunner: open information extraction on the web

NAACL-Demonstrations '07 Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations
Finding cars, goddesses and enzymes: parametrizable acquisition of labeled instances for open-domain information extraction

AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 2
Unsupervised named-entity extraction from the Web: An experimental study

Artificial Intelligence
Answering table augmentation queries from unstructured lists on the web

Proceedings of the VLDB Endowment
Automatic set instance extraction using the web

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Character-level analysis of semi-structured documents for set expansion

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 - Volume 3
Coupled semi-supervised learning for information extraction

Proceedings of the third ACM international conference on Web search and data mining
A semi-supervised method to learn and construct taxonomies using the web

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Probabilistic Metrics for Soft-Clustering and Topic Model Validation

WI-IAT '10 Proceedings of the 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Towards the web of concepts: extracting concepts from large datasets

Proceedings of the VLDB Endowment
Annotating and searching web tables using entities, types and relationships

Proceedings of the VLDB Endowment
Joint training for open-domain extraction on the web: exploiting overlap when supervision is limited

Proceedings of the fourth ACM international conference on Web search and data mining

Collectively representing semi-structured data from the web

AKBC-WEKEX '12 Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction
Learning to find comparable entities on the web

WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering
Knowledge harvesting in the big-data era

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Assessing sparse information extraction using semantic contexts

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
WebChild: harvesting and organizing commonsense knowledge from the web

Proceedings of the 7th ACM international conference on Web search and data mining
Acquisition of open-domain classes via intersective semantics

Proceedings of the 23rd international conference on World wide web

Quantified Score

Hi-index	0.00

Visualization

Abstract

We describe a open-domain information extraction method for extracting concept-instance pairs from an HTML corpus. Most earlier approaches to this problem rely on combining clusters of distributionally similar terms and concept-instance pairs obtained with Hearst patterns. In contrast, our method relies on a novel approach for clustering terms found in HTML tables, and then assigning concept names to these clusters using Hearst patterns. The method can be efficiently applied to a large corpus, and experimental results on several datasets show that our method can accurately extract large numbers of concept-instance pairs.