A machine learning approach to Portuguese clause identification

  • Authors:
  • Eraldo R. Fernandes;Cícero N. dos Santos;Ruy L. Milidiú

  • Affiliations:
  • Departamento de Informática, Pontifícia Universidade Católica do Rio de Janeiro, PUC-Rio, Rio de Janeiro, Brazil;Mestrado em Informática Aplicada – MIA, Universidade de Fortaleza – UNIFOR, Fortaleza, Brazil;Departamento de Informática, Pontifícia Universidade Católica do Rio de Janeiro, PUC-Rio, Rio de Janeiro, Brazil

  • Venue:
  • PROPOR'10 Proceedings of the 9th international conference on Computational Processing of the Portuguese Language
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this work, we apply and evaluate a machine-learning-based system to Portuguese clause identification. To the best of our knowledge, this is the first machine-learning-based approach to this task. The proposed system is based on Entropy Guided Transformation Learning. In order to train and evaluate the proposed system, we derive a clause annotated corpus from the Bosque corpus of the Floresta Sintá(c)tica Project – an European and Brazilian Portuguese treebank. We include part-of-speech (POS) tags to the derived corpus by using an automatic state-of-the-art tagger. Additionally, we use a simple heuristic to derive a phrase-chunk-like (PCL) feature from phrases in the Bosque corpus. We train an extractor to this sub-task and use it to automatically include the PCL feature in the derived clause corpus. We use POS and PCL tags as input features in the proposed clause identifier. This system achieves a Fβ=1 of 73.90, when using the golden values of the PCL feature. When the automatic values are used, the system obtains Fβ=1=69.31. These are promising results for a first machine learning approach to Portuguese clause identification. Moreover, these results are achieved using a very simple PCL feature, which is generated by a PCL extractor developed with very little modeling effort.