Multiple model text normalization for the polish language

Authors:
Łukasz Brocki;Krzysztof Marasek;Danijel Koržinek
Affiliations:
Polish-Japanese Institute of Information Technology, Warsaw, Poland;Polish-Japanese Institute of Information Technology, Warsaw, Poland;Polish-Japanese Institute of Information Technology, Warsaw, Poland
Venue:
ISMIS'12 Proceedings of the 20th international conference on Foundations of Intelligent Systems
Year:
2012

Citing 5
Cited 0

Statistical methods for speech recognition

Statistical methods for speech recognition
How to solve it: modern heuristics

How to solve it: modern heuristics
Genetic Algorithms Plus Data Structures Equals Evolution Programs

Genetic Algorithms Plus Data Structures Equals Evolution Programs
Factored language models and generalized parallel backoff

NAACL-Short '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: companion volume of the Proceedings of HLT-NAACL 2003--short papers - Volume 2
Inflection of Polish Multi-Word Proper Names with Morfeusz and Multiflex

Aspects of Natural Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The following paper describes a text normalization program for the Polish language. The program is based on a combination of rule-based and statistical approaches for text normalization. The scope of all words modelled by this solution was divided in three ways: by using grammar features, lemmas of words and words themselves. Each word in the lexicon was assigned a suitable element from each of the aforementioned domains. Finally, the combination of three n-gram models operating in the domains of grammar classes, word lemmas and individual words was combined together using weights adjusted by an evolution strategy to obtain the final solution. The tool is also capable of producing grammar tags on words to aid in further language model creation.