Name phylogeny: a generative model of string variation

Authors:
Nicholas Andrews;Jason Eisner;Mark Dredze
Affiliations:
Johns Hopkins University, Baltimore, MD;Johns Hopkins University, Baltimore, MD;Johns Hopkins University, Baltimore, MD
Venue:
EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Year:
2012

Citing 20
Cited 1

Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Learning String-Edit Distance

IEEE Transactions on Pattern Analysis and Machine Intelligence
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Machine transliteration

Computational Linguistics
Learning to paraphrase: an unsupervised approach using multiple-sequence alignment

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Transformational priors over grammars

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Inducing translation lexicons via diverse similarity measures and bridge languages

COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20
Translation discovery using diverse similarity measures

Translation discovery using diverse similarity measures
Weakly supervised named entity transliteration and discovery from multilingual comparable corpora

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Canonicalization of database records using adaptive similarity measures

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Unsupervised deduplication using cross-field dependencies

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Simultaneous multilingual search for translingual information retrieval

Proceedings of the 17th ACM conference on Information and knowledge management
Latent-variable modeling of string transductions with finite-state methods

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Finding cognate groups using phylogenies

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
A statistical model for lost language decipherment

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
An algorithm for unsupervised transliteration mining with an application to word alignment

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Using learned conditional distributions as edit distance

SSPR'06/SPR'06 Proceedings of the 2006 joint IAPR international conference on Structural, Syntactic, and Statistical Pattern Recognition
Structured databases of named entities from Bayesian nonparametrics

EMNLP '11 Proceedings of the First Workshop on Unsupervised Learning in NLP
Discovering morphological paradigms from plain text using a Dirichlet process mixture model

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Learning bilingual lexicons using the visual similarity of labeled web images

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Three

MDL-based models for transliteration generation

SLSP'13 Proceedings of the First international conference on Statistical Language and Speech Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many linguistic and textual processes involve transduction of strings. We show how to learn a stochastic transducer from an unorganized collection of strings (rather than string pairs). The role of the transducer is to organize the collection. Our generative model explains similarities among the strings by supposing that some strings in the collection were not generated ab initio, but were instead derived by transduction from other, "similar" strings in the collection. Our variational EM learning algorithm alternately reestimates this phylogeny and the transducer parameters. The final learned transducer can quickly link any test name into the final phylogeny, thereby locating variants of the test name. We find that our method can effectively find name variants in a corpus of web strings used to refer to persons in Wikipedia, improving over standard untrained distances such as Jaro-Winkler and Levenshtein distance.