Multiple model text normalization for the polish language

  • Authors:
  • Łukasz Brocki;Krzysztof Marasek;Danijel Koržinek

  • Affiliations:
  • Polish-Japanese Institute of Information Technology, Warsaw, Poland;Polish-Japanese Institute of Information Technology, Warsaw, Poland;Polish-Japanese Institute of Information Technology, Warsaw, Poland

  • Venue:
  • ISMIS'12 Proceedings of the 20th international conference on Foundations of Intelligent Systems
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

The following paper describes a text normalization program for the Polish language. The program is based on a combination of rule-based and statistical approaches for text normalization. The scope of all words modelled by this solution was divided in three ways: by using grammar features, lemmas of words and words themselves. Each word in the lexicon was assigned a suitable element from each of the aforementioned domains. Finally, the combination of three n-gram models operating in the domains of grammar classes, word lemmas and individual words was combined together using weights adjusted by an evolution strategy to obtain the final solution. The tool is also capable of producing grammar tags on words to aid in further language model creation.