Modelling highly inflected languages

Authors:
Mirjam Sepesy Maučec;Zdravko Kačič;Bogomir Horvat
Affiliations:
Faculty for Electrical Engineering and Computer Science, University of Maribor, Smetanova 17, 2000 Maribor, Slovenia;Faculty for Electrical Engineering and Computer Science, University of Maribor, Smetanova 17, 2000 Maribor, Slovenia;Faculty for Electrical Engineering and Computer Science, University of Maribor, Smetanova 17, 2000 Maribor, Slovenia
Venue:
Information Sciences—Informatics and Computer Science: An International Journal
Year:
2004

Citing 11
Cited 0

Stemming algorithms: a case study for detailed evaluation

Journal of the American Society for Information Science - Special issue: evaluation of information retrieval systems
String editing and longest common subsequences

Handbook of formal languages, vol. 2
Statistical methods for speech recognition

Statistical methods for speech recognition
A language modeling approach to information retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Extraction and representation of contextual information for knowledge discovery in texts

Information Sciences—Informatics and Computer Science: An International Journal
On scale and concentration invariance in entropies

Information Sciences: an International Journal
Language Model Adaptation Using Mixtures and an Exponentially Decaying Cache

ICASSP '97 Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97)-Volume 2 - Volume 2
Implementing Agglomerative Hierarchic Clustering Algorithms for Use in Document Retrieval

Implementing Agglomerative Hierarchic Clustering Algorithms for Use in Document Retrieval
Unsupervised learning of the morphology of a natural language

Computational Linguistics
A Bayesian model for morpheme and paradigm identification

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Statistical language models encapsulate varied information, both grammatical and semantic, present in a language. This paper investigates various techniques for overcoming the difficulties in modelling highly inflected languages. The main problem is a large set of different words. We propose to model the grammatical and semantic information of words separately by splitting them into stems and endings. All the information is handled within a data-driven formalism. Grammatical information is well modelled by using short-term dependencies. This article is primarily concerned with the modelling of semantic information diffused through the entire text. It is presumed that the language being modelled is homogeneous in topic. The training corpus, which is very topically heterogeneous, is divided into three semantic levels based on topic similarity with the target environment text. Text on each semantic level is used as training text for one component of a mixture model. A document is defined as a basic unit of a training corpus, which is semantically homogeneous. The similarity of topic between a document and a collection of target environment texts is determined by the cosine vector similarity function and TFIDF weighting heuristic. The crucial question in the case of highly inflected languages is how to define terms. Terms are defined as clusters of words. Clustering is based on approximate string matching. We experimented with Levenshtein distance and Ratcliff/Obershelp similarity measure, both in combination with ending-stripping. Experiments on the Slovenian language were performed on a corpus of VEČER newswire text. The results show a significant reduction in OOV rate and perplexity.