Exploiting parallel texts in the creation of multilingual semantically annotated resources: the MultiSemCor Corpus

  • Authors:
  • L. Bentivogli;E. Pianta

  • Affiliations:
  • ITC-irst, Via Sommarive, 18-38050 Povo, Trento, Italy e-mail: bentivo@itc.it;ITC-irst, Via Sommarive, 18-38050 Povo, Trento, Italy e-mail: bentivo@itc.it

  • Venue:
  • Natural Language Engineering
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this article we illustrate and evaluate an approach to create high quality linguistically annotated resources based on the exploitation of aligned parallel corpora. This approach is based on the assumption that if a text in one language has been annotated and its translation has not, annotations can be transferred from the source text to the target using word alignment as a bridge. The transfer approach has been tested and extensively applied for the creation of the MultiSemCor corpus, an English/Italian parallel corpus created on the basis of the English SemCor corpus. In MultiSemCor the texts are aligned at the word level and word sense annotated with a shared inventory of senses. A number of experiments have been carried out to evaluate the different steps involved in the methodology and the results suggest that the transfer approach is one promising solution to the resource bottleneck. First, it leads to the creation of a parallel corpus, which represents a crucial resource per se. Second, it allows for the exploitation of existing (mostly English) annotated resources to bootstrap the creation of annotated corpora in new (resource-poor) languages with greatly reduced human effort.