Incorporating monolingual corpora into bilingual latent semantic analysis for crosslingual LM adaptation

  • Authors:
  • Yik-Cheung Tam;Tanja Schultz

  • Affiliations:
  • InterACT, Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA;InterACT, Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA

  • Venue:
  • ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

The major limitation in bilingual latent semantic analysis (bLSA) is the requirement of parallel training corpora. Motivated by semi-supervised learning, we propose a clusterbased bLSA training approach to incorporate monolingual corpora. Treating each parallel document pair as centroids of the parallel document clusters, each monolingual document is associated to the closest centroid according to their topic similarity. The resulting parallel document clusters are used as constraints to enforce a one-to-one topic correspondence in variational EM. Slight performance improvement in crosslingual language model adaptation is observed compared to the baseline without monolingual corpora.