Training Statistical Language Models from Grammar-Generated Data: A Comparative Case-Study

Authors:
Beth Ann Hockey;Manny Rayner;Gwen Christian
Affiliations:
NASA Ames Research Center, UCSC UARC, Mail Stop 19-26, Moffet Field CA 94035;University of Geneva, TIM/ISSCO, Geneva 4, Switzerland CH-1211;Dept of Linguistics, UC Santa Cruz,
Venue:
GoTAL '08 Proceedings of the 6th international conference on Advances in Natural Language Processing
Year:
2008

Citing 6
Cited 0

The CommandTalk spoken dialogue system

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Compilation of unification grammars with compositional semantics to speech recognition packages

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Practical issues in compiling typed unification grammars for speech recognition

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Putting Linguistics into Speech Recognition: The Regulus Grammar Compiler (Studies in Computational Linguistics (Stanford, Calif.).)

Putting Linguistics into Speech Recognition: The Regulus Grammar Compiler (Studies in Computational Linguistics (Stanford, Calif.).)
A voice enabled procedure browser for the International Space Station

ACLdemo '05 Proceedings of the ACL 2005 on Interactive poster and demonstration sessions
Evaluating task performance for a unidirectional controlled language medical speech translation system

MST '06 Proceedings of the Workshop on Medical Speech Translation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Statistical language models (SLMs) for speech recognition have the advantage of robustness, and grammar-based models (GLMs) the advantage that they can be built even when little corpus data is available. A known way to attempt to combine these two methodologies is first to create a GLM, and then use that GLM to generate training data for an SLM. It has however been difficult to evaluate the true utility of the idea, since the corpus data used to create the GLM has not in general been explicitly available. We exploit the Open Source Regulus platform, which supports corpus-based construction of linguistically motivated GLMs, to perform a methodologically sound comparison: the same data is used both to create an SLM directly, and also to create a GLM, which is then used to generate data to train an SLM. An evaluation on a medium-vocabulary task showed that the indirect method of constructing the SLM is in fact only marginally better than the direct one. The method used to create the training data is critical, with PCFG generation heavily outscoring CFG generation.