Automatic bilingual lexicon acquisition using random indexing of parallel corpora

Authors:
M. Sahlgren;J. Karlgren
Affiliations:
Swedish Institute of Computer Science, SICS, Box 1263, SE-164 29 Kista, Sweden e-mail: mange@sics.se, jussi@sics.se;Swedish Institute of Computer Science, SICS, Box 1263, SE-164 29 Kista, Sweden e-mail: mange@sics.se, jussi@sics.se
Venue:
Natural Language Engineering
Year:
2005

Citing 9
Cited 15

A statistical approach to machine translation

Computational Linguistics
Identifying word correspondence in parallel texts

HLT '91 Proceedings of the workshop on Speech and Natural Language
Latent semantic indexing: a probabilistic analysis

PODS '98 Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Database-friendly random projections

PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Random projection in dimensionality reduction: applications to image and text data

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Models of translational equivalence among words

Computational Linguistics
A statistical approach to language translation

COLING '88 Proceedings of the 12th conference on Computational linguistics - Volume 1
Dilemma: an instant lexicographer

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1
Using bag-of-concepts to improve the performance of support vector machines in text categorization

COLING '04 Proceedings of the 20th international conference on Computational Linguistics

Scaling distributional similarity to large corpora

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Buzz Monitoring in Word Space

EuroISI '08 Proceedings of the 1st European Conference on Intelligence and Security Informatics
Representing words as regions in vector space

CoNLL '09 Proceedings of the Thirteenth Conference on Computational Natural Language Learning
Random indexing using statistical weight functions

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Phrase Translation Extraction from Aligned Parallel Corpora Using Suffix Arrays and Related Structures

EPIA '09 Proceedings of the 14th Portuguese Conference on Artificial Intelligence: Progress in Artificial Intelligence
Supporting inferences in semantic space: representing words as regions

IWCS-8 '09 Proceedings of the Eighth International Conference on Computational Semantics
What is word meaning, really?: (and how can distributional models help us describe it?)

GEMS '10 Proceedings of the 2010 Workshop on GEometrical Models of Natural Language Semantics
Concept based representations for ranking in geographic information retrieval

IceTAL'10 Proceedings of the 7th international conference on Advances in natural language processing
Maximum likelihood alignment of translation equivalents

FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing
Dynamic lexica for query translation

CLEF'04 Proceedings of the 5th conference on Cross-Language Evaluation Forum: multilingual Information Access for Text, Speech and Images
A random indexing approach for web user clustering and web prefetching

PAKDD'11 Proceedings of the 15th international conference on New Frontiers in Applied Data Mining
Graph-based alignment of narratives for automated neurological assessment

BioNLP '12 Proceedings of the 2012 Workshop on Biomedical Natural Language Processing
iCLEF 2006 Overview: searching the flickr WWW photo-sharing repository

CLEF'06 Proceedings of the 7th international conference on Cross-Language Evaluation Forum: evaluation of multilingual and multi-modal information retrieval
Trusting the results in cross-lingual keyword-based image retrieval

CLEF'06 Proceedings of the 7th international conference on Cross-Language Evaluation Forum: evaluation of multilingual and multi-modal information retrieval
Cross-lingual random indexing for information retrieval

SLSP'13 Proceedings of the First international conference on Statistical Language and Speech Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a very simple and effective approach to using parallel corpora for automatic bilingual lexicon acquisition. The approach, which uses the Random Indexing vector space methodology, is based on finding correlations between terms based on their distributional characteristics. The approach requires a minimum of preprocessing and linguistic knowledge, and is efficient, fast and scalable. In this paper, we explain how our approach differs from traditional cooccurrence-based word alignment algorithms, and we demonstrate how to extract bilingual lexica using the Random Indexing approach applied to aligned parallel data. The acquired lexica are evaluated by comparing them to manually compiled gold standards, and we report overlap of around 60%. We also discuss methodological problems with evaluating lexical resources of this kind.