Creating and exploiting a comparable corpus in cross-language information retrieval

  • Authors:
  • Tuomas Talvensaari;Jorma Laurikkala;Kalervo Järvelin;Martti Juhola;Heikki Keskustalo

  • Affiliations:
  • University of Tampere, Finland;University of Tampere, Finland;University of Tampere, Finland;University of Tampere, Finland;University of Tampere, Finland

  • Venue:
  • ACM Transactions on Information Systems (TOIS)
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present a method for creating a comparable text corpus from two document collections in different languages. The collections can be very different in origin. In this study, we build a comparable corpus from articles by a Swedish news agency and a U.S. newspaper. The keys with best resolution power were extracted from the documents of one collection, the source collection, by using the relative average term frequency (RATF) value. The keys were translated into the language of the other collection, the target collection, with a dictionary-based query translation program. The translated queries were run against the target collection and an alignment pair was made if the retrieved documents matched given date and similarity score criteria. The resulting comparable collection was used as a similarity thesaurus to translate queries along with a dictionary-based translator. The combined approaches outperformed translation schemes where dictionary-based translation or corpus translation was used alone.