Pseudo-aligned multilingual corpora

Authors:
Fernando Diaz;Donald Metzler
Affiliations:
Department of Computer Science, University of Massachusetts, Amherst, MA;Department of Computer Science, University of Massachusetts, Amherst, MA
Venue:
IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Year:
2007

Citing 8
Cited 6

Language Modeling for Information Retrieval

Language Modeling for Information Retrieval
Matching words and pictures

The Journal of Machine Learning Research
The Web as a parallel corpus

Computational Linguistics - Special issue on web as corpus
A program for aligning sentences in bilingual corpora

Computational Linguistics - Special issue on using large corpora: I
Diffusion Kernels on Statistical Manifolds

The Journal of Machine Learning Research
Regularizing ad hoc retrieval scores

Proceedings of the 14th ACM international conference on Information and knowledge management
Canonical Correlation Analysis: An Overview with Application to Learning Methods

Neural Computation
Gaussian fields for semi-supervised regression and correspondence learning

Pattern Recognition

Manifold alignment using Procrustes analysis

Proceedings of the 25th international conference on Machine learning
Correlation clustering for crosslingual link detection

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Manifold alignment without correspondence

IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
Preliminary study into query translation for patent retrieval

PaIR '10 Proceedings of the 3rd international workshop on Patent information retrieval
Efficiency investigation of manifold matching for text document classification

Pattern Recognition Letters
Manifold alignment preserving global geometry

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

In machine translation, document alignment refers to finding correspondences between documents which are exact translations of each other. We define pseudo-alignment as the task of finding topical--as opposed to exact--correspondences between documents in different languages. We apply semisupervised methods to pseudo-align multilingual corpora. Specifically, we construct a topic-based graph for each language. Then, given exact correspondences between a subset of documents, we project the unaligned documents into a shared lower-dimensional space. We demonstrate that close documents in this lower-dimensional space tend to share the same topic. This has applications in machine translation and cross-lingual information analysis. Experimental results show that pseudo-alignment of multilingual corpora is feasible and that the document alignments produced are qualitatively sound. Our technique requires no linguistic knowledge of the corpus. On average when 10% of the corpus consists of exact correspondences, an on-topic correspondence occurs within the top 5 foreign neighbors in the lower-dimensional space while the exact correspondence occurs within the top 10 foreign neighbors in this this space. We also show how to substantially improve these results with a novel method for incorporating language-independent information.