Latent semantic transliteration using dirichlet mixture

Authors:
Masato Hagiwara;Satoshi Sekine
Affiliations:
Rakuten Institute of Technology, New York, NY;Rakuten Institute of Technology, New York, NY
Venue:
NEWS '12 Proceedings of the 4th Named Entity Workshop
Year:
2012

Citing 10
Cited 0

Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
Latent dirichlet allocation

The Journal of Machine Learning Research
Machine transliteration

Computational Linguistics
An improved error model for noisy channel spelling correction

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
A joint source-channel model for machine transliteration

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Learning a spelling error model from search query logs

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Japanese query alteration based on semantic similarity

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Report of NEWS 2009 machine transliteration shared task

NEWS '09 Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration
Whitepaper of NEWS 2009 machine transliteration shared task

NEWS '09 Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration
Latent class transliteration based on source language origin

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2

Quantified Score

Hi-index	0.00

Visualization

Abstract

Transliteration has been usually recognized by spelling-based supervised models. However, a single model cannot deal with mixture of words with different origins, such as "get" in "piaget" and "target". Li et al. (2007) propose a class transliteration method, which explicitly models the source language origins and switches them to address this issue. In contrast to their model which requires an explicitly tagged training corpus with language origins, Hagiwara and Sekine (2011) have proposed the latent class transliteration model, which models language origins as latent classes and train the transliteration table via the EM algorithm. However, this model, which can be formulated as unigram mixture, is prone to overfitting since it is based on maximum likelihood estimation. We propose a novel latent semantic transliteration model based on Dirichlet mixture, where a Dirichlet mixture prior is introduced to mitigate the overfitting problem. We have shown that the proposed method considerably outperform the conventional transliteration models.