Class-based n-gram models of natural language
Computational Linguistics
Introduction to Evolutionary Computing
Introduction to Evolutionary Computing
Hi-index | 0.00 |
Bayesian classification using Markov model analysis of token strings is used in many areas such as computational linguistics, speech recognition, and bioinformatics. Unfortunately, for many problems, the available data sets are too small to accurately estimate the large number of parameters in a Markov model. In our work, we explore the possibility of using string space transformations to reduce the perplexity of the modeling problem and thereby improve model performance. The set of all possible string-to-string transformation functions is very large. By using a genetic algorithm to search for transformation functions that improve the performance of a Markov-based classifier, we are able to construct a classifier system that performs better than the Markov classifier alone. We go on to demonstrate the improved performance on the problem of classifying English and Spanish character strings, where training set size is arbitrarily limited.