Entity clustering across languages

Authors:
Spence Green;Nicholas Andrews;Matthew R. Gormley;Mark Dredze;Christopher D. Manning
Affiliations:
Stanford University;Johns Hopkins University;Johns Hopkins University;Johns Hopkins University;Stanford University
Venue:
NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Year:
2012

Citing 32
Cited 0

A vector space model for automatic indexing

Communications of the ACM
From Instance-level Constraints to Space-Level Constraints: Making the Most of Prior Knowledge in Data Clustering

ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
Machine transliteration

Computational Linguistics
Multilingual coreference resolution

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
Entity-based cross-document coreferencing using the Vector Space Model

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Unsupervised learning of name structure from coreference data

NAACL '01 Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies
Accurate unlexicalized parsing

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Unsupervised personal name disambiguation

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
On coreference resolution performance metrics

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Multi-lingual coreference resolution with syntactic features

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Cross linguistic name matching in English and Arabic: a "one to many mapping" extension of the Levenshtein edit distance algorithm

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
Scalable training of L1-regularized log-linear models

Proceedings of the 24th international conference on Machine learning
Introduction to Information Retrieval

Introduction to Information Retrieval
Cross-document cross-lingual coreference retrieval

Proceedings of the 17th ACM conference on Information and knowledge management
Cross language name matching

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
The NVI clustering evaluation measure

CoNLL '09 Proceedings of the Thirteenth Conference on Computational Natural Language Learning
Who is who and what is what: experiments in cross-document co-reference

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
A simple and effective hierarchical phrase reordering model

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Structured generative models for unsupervised named-entity clustering

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Arabic cross-document person name normalization

Semitic '07 Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources
Arabic cross-document coreference detection

ACLShort '09 Proceedings of the ACL-IJCNLP 2009 Conference Short Papers
Polylingual topic models

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
Unsupervised and constrained Dirichlet process mixture models for verb clustering

GEMS '09 Proceedings of the Workshop on Geometrical Models of Natural Language Semantics
Improving the multilingual user experience of Wikipedia using cross-language name search

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Untangling the cross-lingual link structure of Wikipedia

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Evaluation metrics for end-to-end coreference resolution systems

SIGDIAL '10 Proceedings of the 11th Annual Meeting of the Special Interest Group on Discourse and Dialogue
Streaming cross document entity coreference resolution

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Large-scale cross-document coreference using distributed inference and hierarchical models

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Distance Dependent Chinese Restaurant Processes

The Journal of Machine Learning Research
Cross-document transliterated personal name coreference resolution

FSKD'05 Proceedings of the Second international conference on Fuzzy Systems and Knowledge Discovery - Volume Part II
A new metric for probability distributions

IEEE Transactions on Information Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

Standard entity clustering systems commonly rely on mention (string) matching, syntactic features, and linguistic resources like English WordNet. When co-referent text mentions appear in different languages, these techniques cannot be easily applied. Consequently, we develop new methods for clustering text mentions across documents and languages simultaneously, producing cross-lingual entity clusters. Our approach extends standard clustering algorithms with cross-lingual mention and context similarity measures. Crucially, we do not assume a pre-existing entity list (knowledge base), so entity characteristics are unknown. On an Arabic-English corpus that contains seven different text genres, our best model yields a 24.3% F1 gain over the baseline.