Language Modeling for Information Retrieval
Language Modeling for Information Retrieval
The Journal of Machine Learning Research
Computational Linguistics - Special issue on web as corpus
A program for aligning sentences in bilingual corpora
Computational Linguistics - Special issue on using large corpora: I
Diffusion Kernels on Statistical Manifolds
The Journal of Machine Learning Research
Regularizing ad hoc retrieval scores
Proceedings of the 14th ACM international conference on Information and knowledge management
Gaussian fields for semi-supervised regression and correspondence learning
Pattern Recognition
Manifold alignment using Procrustes analysis
Proceedings of the 25th international conference on Machine learning
Correlation clustering for crosslingual link detection
IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Manifold alignment without correspondence
IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
Preliminary study into query translation for patent retrieval
PaIR '10 Proceedings of the 3rd international workshop on Patent information retrieval
Efficiency investigation of manifold matching for text document classification
Pattern Recognition Letters
Manifold alignment preserving global geometry
IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Hi-index | 0.00 |
In machine translation, document alignment refers to finding correspondences between documents which are exact translations of each other. We define pseudo-alignment as the task of finding topical--as opposed to exact--correspondences between documents in different languages. We apply semisupervised methods to pseudo-align multilingual corpora. Specifically, we construct a topic-based graph for each language. Then, given exact correspondences between a subset of documents, we project the unaligned documents into a shared lower-dimensional space. We demonstrate that close documents in this lower-dimensional space tend to share the same topic. This has applications in machine translation and cross-lingual information analysis. Experimental results show that pseudo-alignment of multilingual corpora is feasible and that the document alignments produced are qualitatively sound. Our technique requires no linguistic knowledge of the corpus. On average when 10% of the corpus consists of exact correspondences, an on-topic correspondence occurs within the top 5 foreign neighbors in the lower-dimensional space while the exact correspondence occurs within the top 10 foreign neighbors in this this space. We also show how to substantially improve these results with a novel method for incorporating language-independent information.