An account of the challenge of tagging a reference corpus for Brazilian Portuguese

  • Authors:
  • Sandra Aluísio;Jorge Pelizzoni;Ana Raquel Marchi;Lucélia de Oliveira;Regiana Manenti;Vanessa Marquiafável

  • Affiliations:
  • ICMC - DCCE, University of São Paulo, São Carlos, SP, Brazil and Núcleo Interinstitucional de Lingüística Computacional, ICMC-USP, São Carlos, SP, Brazil;Núcleo Interinstitucional de Lingüística Computacional, ICMC-USP, São Carlos, SP, Brazil;Núcleo Interinstitucional de Lingüística Computacional, ICMC-USP, São Carlos, SP, Brazil;Núcleo Interinstitucional de Lingüística Computacional, ICMC-USP, São Carlos, SP, Brazil;Núcleo Interinstitucional de Lingüística Computacional, ICMC-USP, São Carlos, SP, Brazil;Núcleo Interinstitucional de Lingüística Computacional, ICMC-USP, São Carlos, SP, Brazil

  • Venue:
  • PROPOR'03 Proceedings of the 6th international conference on Computational processing of the Portuguese language
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

This article identifies and addresses the major linguistic/conceptual, as opposed to logistic, issues faced in the morphosyntactic tagging of MAC-Morpho, a 1.1 million word Brazilian Portuguese corpus of newspaper articles that has been developed in the Lacio-Web Project. Rather than simply presenting the annotated corpus and describing its tagset, we elaborate on the criteria for establishing the tagset and analyze some interesting cases amongst the linguistic problems we faced in this work.