Processing comparable corpora with Bilingual Suffix Trees

  • Authors:
  • Dragos Stefan Munteanu;Daniel Marcu

  • Affiliations:
  • University of Southern California, Marina del Rey, CA;University of Southern California, Marina del Rey, CA

  • Venue:
  • EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

We introduce Bilingual Suffix Trees (BST), a data structure that is suitable for exploiting comparable corpora. We discuss algorithms that use BSTs in order to create parallel corpora and learn translations of unseen words from comparable corpora. Starting with a small bilingual dictionary that was derived automatically from a corpus of 5.000 parallel sentences, we have automatically extracted a corpus of 33.926 parallel phrases of size greater than 3, and learned 9 new word translations from a comparable corpus of 1.3M words (100.000 sentences).