A hybrid language model based on a combination of N-grams and stochastic context-free grammars

Authors:
Diego Linares;José-Miguel Benedí;Joan-Andreu Sánchez
Affiliations:
Pontificia Universidad Javeriana, Cali, Colombia;Universidad Politécnica de Valencia, Valencia, Spain;Universidad Politécnica de Valencia, Valencia, Spain
Venue:
ACM Transactions on Asian Language Information Processing (TALIP)
Year:
2004

Citing 10
Cited 4

An efficient probabilistic context-free parsing algorithm that computes prefix probabilities

Computational Linguistics
Statistical methods for speech recognition

Statistical methods for speech recognition
An efficient context-free parsing algorithm

Communications of the ACM
Tree-bank Grammars

Tree-bank Grammars
Computation of the probability of initial substring generation by stochastic context-free grammars

Computational Linguistics
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
Probabilistic top-down parsing and language modeling

Computational Linguistics
Inside-outside reestimation from partially bracketed corpora

ACL '92 Proceedings of the 30th annual meeting on Association for Computational Linguistics
Combination of n-grams and Stochastic Context-Free Grammars for language modeling

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Immediate-head parsing for language models

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics

Corpus based learning of stochastic, context-free grammars combined with Hidden Markov Models for tRNA modelling

International Journal of Bioinformatics Research and Applications
Extracting Grammars from RNA Sequences

ICANNGA '07 Proceedings of the 8th international conference on Adaptive and Natural Computing Algorithms, Part I
Statistical and linguistic clustering for language modeling in ASR

CIARP'05 Proceedings of the 10th Iberoamerican Congress conference on Progress in Pattern Recognition, Image Analysis and Applications
Performance of a SCFG-based language model with training data sets of increasing size

IbPRIA'05 Proceedings of the Second Iberian conference on Pattern Recognition and Image Analysis - Volume Part II

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, a hybrid language model is defined as a combination of a word-based n-gram, which is used to capture the local relations between words, and a category-based stochastic context-free grammar (SCFG) with a word distribution into categories, which is defined to represent the long-term relations between these categories. The problem of unsupervised learning of a SCFG in General Format and in Chomsky Normal Form by means of estimation algorithms is studied. Moreover, a bracketed version of the classical estimation algorithm based on the Earley algorithm is proposed. This paper also explores the use of SCFGs obtained from a treebank corpus as initial models for the estimation algorithms. Experiments on the UPenn Treebank corpus are reported. These experiments have been carried out in terms of the test set perplexity and the word error rate in a speech recognition experiment.