A Machine Learning Approach to POS Tagging

  • Authors:
  • Lluís Màrquez;Lluís Padró;Horacio Rodríguez

  • Affiliations:
  • Departament de Llenguatges i Sistemes Informàtics, Universitat Politècnica de Catalunya, c/ Jordi Girona 1–3. Barcelona 08034, Catalonia. lluism@lsi.upc.es;Departament de Llenguatges i Sistemes Informàtics, Universitat Politècnica de Catalunya, c/ Jordi Girona 1–3. Barcelona 08034, Catalonia. padro@lsi.upc.es;Departament de Llenguatges i Sistemes Informàtics, Universitat Politècnica de Catalunya, c/ Jordi Girona 1–3. Barcelona 08034, Catalonia. horacio@lsi.upc.es

  • Venue:
  • Machine Learning
  • Year:
  • 2000

Quantified Score

Hi-index 0.00

Visualization

Abstract

We have applied the inductive learning of statisticaldecision trees and relaxation labeling to the Natural LanguageProcessing (NLP) task of morphosyntacticdisambiguation (Part Of Speech Tagging). The learning process issupervised and obtains a language model oriented to resolve POSambiguities, consisting of a set of statistical decision treesexpressing distribution of tags and words in some relevant contexts.The acquired decision trees have been directly used in a tagger thatis both relatively simple and fast, and which has been tested andevaluated on the Wall Street Journal (WSJ) corpus withcompetitive accuracy. However, better results can be obtained bytranslating the trees into rules to feed a flexible relaxationlabeling based tagger. In this direction we describe a tagger whichis able to use information of any kind (n-grams, automaticallyacquired constraints, linguistically motivated manually writtenconstraints, etc.), and in particular to incorporate themachine-learned decision trees. Simultaneously, we address theproblem of tagging when only limited training material is available,which is crucial in any process of constructing, from scratch, anannotated corpus. We show that high levels of accuracy can beachieved with our system in this situation, and report some resultsobtained when using it to develop a 5.5 million words Spanish corpusfrom scratch.