Speech Synthesis and Recognition
Speech Synthesis and Recognition
Hi-index | 0.00 |
Classification of emotions was conducted for Mexican Spanish. Four different sets of features were used to find the best differentiation of eight emotions taken from the recordings of three poems spoken by a professional announcer. The sets of features included statistics of the fundamental frequency and the first four formants of the speech signal, the duration of pauses and the time frame intensity. The classification was made using an unsupervised neural network architecture based on a self-organized map. By considering each set of features separately, the 30ms time frame intensity was that which presented the best results splitting similar emotions like sadness, contempt, and melancholy from other kinds of emotional states like happiness, anger and derision. The results were improved by adding the mean value of the fundamental frequency to the time frame intensity; results in each poem showed the eight emotional states, including an emotion defined as normal, but the performance drops when integrating all the data into one set.