Scalable knowledge harvesting with high precision and high recall

Authors:
Ndapandula Nakashole;Martin Theobald;Gerhard Weikum
Affiliations:
Max Planck Institute for Informatics, Saarbrucken, Germany;Max Planck Institute for Informatics, Saarbrucken, Germany;Max Planck Institute for Informatics, Saarbrucken, Germany
Venue:
Proceedings of the fourth ACM international conference on Web search and data mining
Year:
2011

Citing 29
Cited 24

Mining association rules between sets of items in large databases

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
A new approach to the minimum cut problem

Journal of the ACM (JACM)
A parallel algorithm for multilevel graph partitioning and sparse matrix ordering

Journal of Parallel and Distributed Computing
Snowball: extracting relations from large plain-text collections

DL '00 Proceedings of the fifth ACM conference on Digital libraries
Mining Sequential Patterns: Generalizations and Performance Improvements

EDBT '96 Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology
Extracting Patterns and Relations from the World Wide Web

WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Markov logic networks

Machine Learning
Yago: a core of semantic knowledge

Proceedings of the 16th international conference on World Wide Web
Declarative information extraction using datalog with embedded extraction predicates

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Automatically refining the wikipedia infobox ontology

Proceedings of the 17th international conference on World Wide Web
Information extraction challenges in managing unstructured data

ACM SIGMOD Record
StatSnowball: a statistical approach to extracting entity relationships

Proceedings of the 18th international conference on World wide web
SOFIE: a self-organizing framework for information extraction

Proceedings of the 18th international conference on World wide web
An Algebraic Approach to Rule-Based Information Extraction

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Join Optimization of Information Extraction Output: Quality Matters!

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Learning and inference with constraints

AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 3
Open information extraction from the web

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
MapReduce: a flexible data processing tool

Communications of the ACM - Amir Pnueli: Ahead of His Time
Unsupervised named-entity extraction from the Web: An experimental study

Artificial Intelligence
Hadoop: The Definitive Guide

Hadoop: The Definitive Guide
Coupled semi-supervised learning for information extraction

Proceedings of the third ACM international conference on Web search and data mining
Timely YAGO: harvesting, querying, and visualizing temporal knowledge from Wikipedia

Proceedings of the 13th International Conference on Extending Database Technology
DBpedia: a nucleus for a web of open data

ISWC'07/ASWC'07 Proceedings of the 6th international The semantic web and 2nd Asian conference on Asian semantic web conference
From information to knowledge: harvesting entities and relationships from web sources

Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Markov Logic: An Interface Layer for Artificial Intelligence

Markov Logic: An Interface Layer for Artificial Intelligence
Find your advisor: robust knowledge gathering from the web

Procceedings of the 13th International Workshop on the Web and Databases
Modeling relations and their mentions without labeled text

ECML PKDD'10 Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part III
The Hadoop Distributed File System

MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)
Text2Onto: a framework for ontology learning and data-driven change discovery

NLDB'05 Proceedings of the 10th international conference on Natural Language Processing and Information Systems

Database researchers: plumbers or thinkers?

Proceedings of the 14th International Conference on Extending Database Technology
Database foundations for scalable RDF processing

RW'11 Proceedings of the 7th international conference on Reasoning web: semantic technologies for the web of data
S3K: seeking statement-supporting top-K witnesses
Harvesting facts from textual web sources by constrained label propagation

Proceedings of the 20th ACM international conference on Information and knowledge management
Robust disambiguation of named entities in text

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Discovering and exploring relations on the web

Proceedings of the VLDB Endowment
Open language learning for information extraction

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
PATTY: a taxonomy of relational patterns with semantic types

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Towards distributed MCMC inference in probabilistic knowledge bases

AKBC-WEKEX '12 Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction
Real-time population of knowledge bases: opportunities and challenges

AKBC-WEKEX '12 Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction
PRAVDA-live: interactive knowledge harvesting

Proceedings of the 21st ACM international conference on Information and knowledge management
YAGO2: A spatially and temporally enhanced knowledge base from Wikipedia

Artificial Intelligence
Extracting multilingual natural-language patterns for RDF predicates

EKAW'12 Proceedings of the 18th international conference on Knowledge Engineering and Knowledge Management
An evidence-based verification approach to extract entities and relations for knowledge base population

ISWC'12 Proceedings of the 11th international conference on The Semantic Web - Volume Part I
Knowledge harvesting in the big-data era

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Mind the gap: large-scale frequent sequence mining

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Open domain knowledge extraction: inference on a web scale

Proceedings of the 3rd International Conference on Web Intelligence, Mining and Semantics
Discovering and disambiguating named entities in text

Proceedings of the 2013 Sigmod/PODS Ph.D. symposium on PhD symposium
Inside YAGO2s: a transparent information extraction architecture

Proceedings of the 22nd international conference on World Wide Web companion
Autonomously reviewing and validating the knowledge base of a never-ending learning system

Proceedings of the 22nd international conference on World Wide Web companion
A semi-supervised approach to extract pharmacogenomics-specific drug-gene pairs from biomedical literature for personalized medicine

Journal of Biomedical Informatics
Knowledge base population and visualization using an ontology based on semantic roles

Proceedings of the 2013 workshop on Automated knowledge base construction
Integration of large scale knowledge bases using probabilistic graphical models

Proceedings of the 7th ACM international conference on Web search and data mining
Guided curation of semistructured data in collaboratively-built knowledge bases

Future Generation Computer Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Harvesting relational facts from Web sources has received great attention for automatically constructing large knowledge bases. Stateof-the-art approaches combine pattern-based gathering of fact candidates with constraint-based reasoning. However, they still face major challenges regarding the trade-offs between precision, recall, and scalability. Techniques that scale well are susceptible to noisy patterns that degrade precision, while techniques that employ deep reasoning for high precision cannot cope with Web-scale data. This paper presents a scalable system, called PROSPERA, for high-quality knowledge harvesting. We propose a new notion of ngram-itemsets for richer patterns, and use MaxSat-based constraint reasoning on both the quality of patterns and the validity of fact candidates.We compute pattern-occurrence statistics for two benefits: they serve to prune the hypotheses space and to derive informative weights of clauses for the reasoner. The paper shows how to incorporate these building blocks into a scalable architecture that can parallelize all phases on a Hadoop-based distributed platform. Our experiments with the ClueWeb09 corpus include comparisons to the recent ReadTheWeb experiment. We substantially outperform these prior results in terms of recall, with the same precision, while having low run-times.