Lexical triggers and latent semantic analysis for cross-lingual language model adaptation

Authors:
Woosung Kim;Sanjeev Khudanpur
Affiliations:
The Johns Hopkins University, Baltimore, MD;The Johns Hopkins University, Baltimore, MD
Venue:
ACM Transactions on Asian Language Information Processing (TALIP)
Year:
2004

Citing 6
Cited 5

Using linear algebra for intelligent information retrieval

SIAM Review
The mathematics of statistical machine translation: parameter estimation

Computational Linguistics - Special issue on using large corpora: II
Inducing multilingual text analysis tools via robust projection across aligned corpora

HLT '01 Proceedings of the first international conference on Human language technology research
Improved statistical alignment models

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Cross-lingual lexical triggers in statistical language modeling

EMNLP '03 Proceedings of the 2003 conference on Empirical methods in natural language processing
Towards language independent acoustic modeling

ICASSP '00 Proceedings of the Acoustics, Speech, and Signal Processing, 2000. on IEEE International Conference - Volume 02

POS tagging of dialectal Arabic: a minimally supervised approach

Semitic '05 Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages
Multilingual topic models for unaligned text

UAI '09 Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence
Cross-lingual latent topic extraction

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Holistic sentiment analysis across languages: multilingual supervised latent Dirichlet allocation

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Mining monolingual and bilingual corpora

Intelligent Data Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

In-domain texts for estimating statistical language models are not easily found for most languages of the world. We present two techniques to take advantage of in-domain text resources in other languages. First, we extend the notion of lexical triggers, which have been used monolingually for language model adaptation, to the cross-lingual problem, permitting the construction of sharper language models for a target-language document by drawing statistics from related documents in a resource-rich language. Next, we show that cross-lingual latent semantic analysis is similarly capable of extracting useful statistics for language modeling. Neither technique requires explicit translation capabilities between the two languages! We demonstrate significant reductions in both perplexity and word error rate on a Mandarin speech recognition task by using these techniques.