How comparable are parallel corpora? Measuring the distribution of general vocabulary and connectives

  • Authors:
  • Bruno Cartoni;Sandrine Zufferey;Thomas Meyer;Andrei Popescu-Belis

  • Affiliations:
  • University of Geneva, rue de Candolle, Geneva;University of Geneva, rue de Candolle, Geneva;Idiap Research Institute, Rue Marconi, Martigny;Idiap Research Institute, Rue Marconi, Martigny

  • Venue:
  • BUCC '11 Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we question the homogeneity of a large parallel corpus by measuring the similarity between various sub-parts. We compare results obtained using a general measure of lexical similarity based on χ2 and by counting the number of discourse connectives. We argue that discourse connectives provide a more sensitive measure, revealing differences that are not visible with the general measure. We also provide evidence for the existence of specific characteristics defining translated texts as opposed to non-translated ones, due to a universal tendency for explicitation.