Class-based n-gram models of natural language
Computational Linguistics
A neural probabilistic language model
The Journal of Machine Learning Research
ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Hi-index | 0.00 |
We address the issue of data sparseness problem in language model (LM). Using class LM is one way to avoid this problem. In class LM, infrequent words are supported by more frequent words in the same class. This paper investigates a class LM based on LSA. A word-document matrix is usually used to represent a corpus in LSA framework. However, this matrix ignores word order in the sentence. We propose several word co-occurrence matrices that keep word order. Together with these matrices, we define a context dependent class (CDC) LM which distinguishes classes according to their context in the sentences. Experiments on Wall Street Journal (WSJ) corpus show that the word co-occurrence matrix works better than word-document matrix. Furthermore, the CDC achieves better perplexity than the traditional class LM based on LSA.