Modified MMI/MPE: a direct evaluation of the margin in speech recognition

Authors:
Georg Heigold;Thomas Deselaers;Ralf Schlüter;Hermann Ney
Affiliations:
RWTH Aachen University Chair of Computer Science, Aachen, Germany;RWTH Aachen University Chair of Computer Science, Aachen, Germany;RWTH Aachen University Chair of Computer Science, Aachen, Germany;RWTH Aachen University Chair of Computer Science, Aachen, Germany
Venue:
Proceedings of the 25th international conference on Machine learning
Year:
2008

Citing 2
Cited 3

The nature of statistical learning theory

The nature of statistical learning theory
Discriminative, generative and imitative learning

Discriminative, generative and imitative learning

Efficient Heuristics for Discriminative Structure Learning of Bayesian Network Classifiers

The Journal of Machine Learning Research
Large margin learning of Bayesian classifiers based on Gaussian mixture models

ECML PKDD'10 Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part III
Minimum-risk training for semi-Markov conditional random fields with application to handwritten Chinese/Japanese text recognition

Pattern Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we show how common speech recognition training criteria such as the Minimum Phone Error criterion or the Maximum Mutual Information criterion can be extended to incorporate a margin term. Different margin-based training algorithms have been proposed to refine existing training algorithms for general machine learning problems. However, for speech recognition, some special problems have to be addressed and all approaches proposed either lack practical applicability or the inclusion of a margin term enforces significant changes to the underlying model, e.g. the optimization algorithm, the loss function, or the parameterization of the model. In our approach, the conventional training criteria are modified to incorporate a margin term. This allows us to do large-margin training in speech recognition using the same efficient algorithms for accumulation and optimization and to use the same software as for conventional discriminative training. We show that the proposed criteria are equivalent to Support Vector Machines with suitable smooth loss functions, approximating the non-smooth hinge loss function or the hard error (e.g. phone error). Experimental results are given for two different tasks: the rather simple digit string recognition task Sietill which severely suffers from overfitting and the large vocabulary European Parliament Plenary Sessions English task which is supposed to be dominated by the risk and the generalization does not seem to be such an issue.