A large portuguese corpus on-line: cleaning and preprocessing

  • Authors:
  • Michel Généreux;Iris Hendrickx;Amália Mendes

  • Affiliations:
  • Centro de Linguística da Universidade de Lisboa, Lisboa, Portugal;Centro de Linguística da Universidade de Lisboa, Lisboa, Portugal;Centro de Linguística da Universidade de Lisboa, Lisboa, Portugal

  • Venue:
  • PROPOR'12 Proceedings of the 10th international conference on Computational Processing of the Portuguese Language
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present a newly available on-line resource for Portuguese, a corpus of 310 million words, a new version of the Reference Corpus of Contemporary Portuguese, now searchable via a user-friendly web interface. Here we report on work carried out on the corpus previous to its publication on-line. We focus on the processes and tools involved for the cleaning, preparation and annotation to make the corpus suitable for linguistic inquiries.