Letter level learning for language independent diacritics restoration

Authors:
Rada Mihalcea;Vivi Nastase
Affiliations:
University of North Texas, Denton, TX;University of Ottawa, Ottawa, ON
Venue:
COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20
Year:
2002

Citing 5
Cited 7

C4.5: programs for machine learning

C4.5: programs for machine learning
Forgetting Exceptions is Harmful in Language Learning

Machine Learning - Special issue on natural language learning
Information Retrieval

Information Retrieval
Decision lists for lexical ambiguity resolution: application to accent restoration in Spanish and French

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
Scaling to very very large corpora for natural language disambiguation

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics

Character N-Gram Tokenization for European Language Text Retrieval

Information Retrieval
Constrained Sequence Classification for Lexical Disambiguation

PRICAI '08 Proceedings of the 10th Pacific Rim International Conference on Artificial Intelligence: Trends in Artificial Intelligence
Analysis of Automatic Stress Assignment in Slovene

Informatica
Automatic diacritic restoration for resource-scarce languages

TSD'07 Proceedings of the 10th international conference on Text, speech and dialogue
Special speech synthesis for social network websites

TSD'10 Proceedings of the 13th international conference on Text, speech and dialogue
Statistical unicodification of African languages

Language Resources and Evaluation
Exploring new languages with HAIRCUT at CLEF 2005

CLEF'05 Proceedings of the 6th international conference on Cross-Language Evalution Forum: accessing Multilingual Information Repositories

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a method for diacritics restoration based on learning mechanisms that act at letter level. The method requires no additional tagging tools or resources other than raw text, which makes it independent of the language, and particularly appealing for languages for which there are few resources available. The algorithm was evaluated on four different languages, namely Czech, Hungarian, Polish and Romanian, and an average accuracy of over 98% was observed.