Numerical recipes in C: the art of scientific computing
Numerical recipes in C: the art of scientific computing
A Cache-Based Natural Language Model for Speech Recognition
IEEE Transactions on Pattern Analysis and Machine Intelligence
Information retrieval: data structures and algorithms
Information retrieval: data structures and algorithms
Class-based n-gram models of natural language
Computational Linguistics
Projections for efficient document clustering
Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Distributional clustering of words for text classification
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Mixtures of probabilistic principal component analyzers
Neural Computation
Probabilistic latent semantic indexing
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Effective information retrieval using term accuracy
Communications of the ACM
Information Retrieval
Language Model Adaptation Using Mixtures and an Exponentially Decaying Cache
ICASSP '97 Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97)-Volume 2 - Volume 2
Statistical Models for Co-occurrence Data
Statistical Models for Co-occurrence Data
SVDPACKC (Version 1.0) User''s Guide
SVDPACKC (Version 1.0) User''s Guide
Distributional clustering of English words
ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
A novel word clustering algorithm based on latent semantic analysis
ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 01
ICASSP '99 Proceedings of the Acoustics, Speech, and Signal Processing, 1999. on 1999 IEEE International Conference - Volume 01
Speech recognition experiments using multi-span statistical language models
ICASSP '99 Proceedings of the Acoustics, Speech, and Signal Processing, 1999. on 1999 IEEE International Conference - Volume 02
Modeling the manifolds of images of handwritten digits
IEEE Transactions on Neural Networks
Proceedings of the 43rd annual Southeast regional conference - Volume 1
Hi-index | 0.00 |
This paper describes an approach for constructing a mixture of language models based on simple statistical notions of semantics using probabilistic models developed for information retrieval. The approach encapsulates corpus-derived semantic information and is able to model varying styles of text. Using such information, the corpus texts are clustered in an unsupervised manner and a mixture of topic-specific language models is automatically created. The principal contribution of this work is to characterise the document space resulting from information retrieval techniques and to demonstrate the approach for mixture language modelling. A comparison is made between manual and automatic clustering in order to elucidate how the global content information is expressed in the space. We also compare (in terms of association with manual clustering and language modelling accuracy) alternative term-weighting schemes and the effect of singular value decomposition dimension reduction (latent semantic analysis). Test set perplexity results using the British National Corpus indicate that the approach can improve the potential of statistical language modelling. Using an adaptive procedure, the conventional model may be tuned to track text data with a slight increase in computational cost.