AnCora-CO: Coreferentially annotated corpora for Spanish and Catalan

  • Authors:
  • Marta Recasens;M. Antònia Martí

  • Affiliations:
  • Centre de Llenguatge i Computació (CLiC), University of Barcelona, Barcelona, Spain 08007;Centre de Llenguatge i Computació (CLiC), University of Barcelona, Barcelona, Spain 08007

  • Venue:
  • Language Resources and Evaluation
  • Year:
  • 2010

Quantified Score

Hi-index 0.01

Visualization

Abstract

This article describes the enrichment of the AnCora corpora of Spanish and Catalan (400 k each) with coreference links between pronouns (including elliptical subjects and clitics), full noun phrases (including proper nouns), and discourse segments. The coding scheme distinguishes between identity links, predicative relations, and discourse deixis. Inter-annotator agreement on the link types is 85---89% above chance, and we provide an analysis of the sources of disagreement. The resulting corpora make it possible to train and test learning-based algorithms for automatic coreference resolution, as well as to carry out bottom-up linguistic descriptions of coreference relations as they occur in real data.