Large-scale cross-document coreference using distributed inference and hierarchical models

Authors:
Sameer Singh;Amarnag Subramanya;Fernando Pereira;Andrew McCallum
Affiliations:
University of Massachusetts, Amherst MA;Google Research, Mountain View CA;Google Research, Mountain View CA;University of Massachusetts, Amherst MA
Venue:
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Year:
2011

Citing 20
Cited 14

Entity-based cross-document coreferencing using the Vector Space Model

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
An integrated, conditional model of information extraction and coreference with application to citation matching

UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
Annealed MAP

UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Unsupervised personal name disambiguation

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior

The Journal of Machine Learning Research
Weakly supervised learning for cross-document person name disambiguation supported by information extraction

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Machine learning for coreference resolution: from local classification to global ranking

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Is Hillary Rodham Clinton the president?: disambiguating names across documents

CorefApp '99 Proceedings of the Workshop on Coreference and its Applications
Who is who and what is what: experiments in cross-document co-reference

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Understanding the value of features for coreference resolution

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
A general method for reducing the complexity of relational inference and its application to MCMC

AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 2
Simple coreference resolution with rich syntactic and semantic features

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 - Volume 3
Coreference resolution in a modular, entity-centered model

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
On dual decomposition and linear programming relaxations for natural language processing

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Turbo parsers: dependency parsing by approximate variational inference

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Parallel Spectral Clustering in Distributed Systems

IEEE Transactions on Pattern Analysis and Machine Intelligence
Streaming cross document entity coreference resolution

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
An unsupervised language independent method of name discrimination using second order co-occurrence features

CICLing'06 Proceedings of the 7th international conference on Computational Linguistics and Intelligent Text Processing

Structured databases of named entities from Bayesian nonparametrics

EMNLP '11 Proceedings of the First Workshop on Unsupervised Learning in NLP
Entity clustering across languages

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
A discriminative hierarchical model for fast coreference at large scale

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
A probabilistic model for canonicalizing named entity mentions

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Monte Carlo MCMC: efficient inference by approximate sampling

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Adding distributional semantics to knowledge base entities through web-scale entity linking

AKBC-WEKEX '12 Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction
Entity linking at web scale

AKBC-WEKEX '12 Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction
Human-machine cooperation with epistemological DBs: supporting user corrections to knowledge bases

AKBC-WEKEX '12 Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction
Monte Carlo MCMC: efficient inference by sampling factors

AKBC-WEKEX '12 Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction
KORE: keyphrase overlap relatedness for entity disambiguation

Proceedings of the 21st ACM international conference on Information and knowledge management
Knowledge harvesting in the big-data era

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Assessing confidence of knowledge base content with an experimental study in entity resolution

Proceedings of the 2013 workshop on Automated knowledge base construction
Ontology-aware partitioning for knowledge graph identification

Proceedings of the 2013 workshop on Automated knowledge base construction
A joint model for discovering and linking entities

Proceedings of the 2013 workshop on Automated knowledge base construction

Quantified Score

Hi-index	0.00

Visualization

Abstract

Cross-document coreference, the task of grouping all the mentions of each entity in a document collection, arises in information extraction and automated knowledge base construction. For large collections, it is clearly impractical to consider all possible groupings of mentions into distinct entities. To solve the problem we propose two ideas: (a) a distributed inference technique that uses parallelism to enable large scale processing, and (b) a hierarchical model of coreference that represents uncertainty over multiple granularities of entities to facilitate more effective approximate inference. To evaluate these ideas, we constructed a labeled corpus of 1.5 million disambiguated mentions in Web pages by selecting link anchors referring to Wikipedia entities. We show that the combination of the hierarchical model with distributed inference quickly obtains high accuracy (with error reduction of 38%) on this large dataset, demonstrating the scalability of our approach.