Topic-based mixture language modelling

Authors:
Yoshihiko Gotoh;Steve Renals
Affiliations:
Department of Computer Science, University of Sheffield, Regent Court, 211 Portobello Street, Sheffield S1 4DP, UK/ e-mail: y.gotoh@dcs.shef.ac.uk, s.renals@dcs.shef.ac.uk;Department of Computer Science, University of Sheffield, Regent Court, 211 Portobello Street, Sheffield S1 4DP, UK/ e-mail: y.gotoh@dcs.shef.ac.uk, s.renals@dcs.shef.ac.uk
Venue:
Natural Language Engineering
Year:
1999

Citing 19
Cited 1

Numerical recipes in C: the art of scientific computing

Numerical recipes in C: the art of scientific computing
A Cache-Based Natural Language Model for Speech Recognition

IEEE Transactions on Pattern Analysis and Machine Intelligence
Information retrieval: data structures and algorithms

Information retrieval: data structures and algorithms
Class-based n-gram models of natural language

Computational Linguistics
Using linear algebra for intelligent information retrieval

SIAM Review
Projections for efficient document clustering

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Distributional clustering of words for text classification

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Mixtures of probabilistic principal component analyzers

Neural Computation
Probabilistic latent semantic indexing

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Effective information retrieval using term accuracy

Communications of the ACM
Information Retrieval

Information Retrieval
Language Model Adaptation Using Mixtures and an Exponentially Decaying Cache

ICASSP '97 Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97)-Volume 2 - Volume 2
Statistical Models for Co-occurrence Data

Statistical Models for Co-occurrence Data
SVDPACKC (Version 1.0) User''s Guide

SVDPACKC (Version 1.0) User''s Guide
Distributional clustering of English words

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
A novel word clustering algorithm based on latent semantic analysis

ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 01
A maximum entropy language model integrating N-grams and topic dependencies for conversational speech recognition

ICASSP '99 Proceedings of the Acoustics, Speech, and Signal Processing, 1999. on 1999 IEEE International Conference - Volume 01
Speech recognition experiments using multi-span statistical language models

ICASSP '99 Proceedings of the Acoustics, Speech, and Signal Processing, 1999. on 1999 IEEE International Conference - Volume 02
Modeling the manifolds of images of handwritten digits

IEEE Transactions on Neural Networks

Understanding without formality: augmenting speech recognition to understand informal verbal commands

Proceedings of the 43rd annual Southeast regional conference - Volume 1

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes an approach for constructing a mixture of language models based on simple statistical notions of semantics using probabilistic models developed for information retrieval. The approach encapsulates corpus-derived semantic information and is able to model varying styles of text. Using such information, the corpus texts are clustered in an unsupervised manner and a mixture of topic-specific language models is automatically created. The principal contribution of this work is to characterise the document space resulting from information retrieval techniques and to demonstrate the approach for mixture language modelling. A comparison is made between manual and automatic clustering in order to elucidate how the global content information is expressed in the space. We also compare (in terms of association with manual clustering and language modelling accuracy) alternative term-weighting schemes and the effect of singular value decomposition dimension reduction (latent semantic analysis). Test set perplexity results using the British National Corpus indicate that the approach can improve the potential of statistical language modelling. Using an adaptive procedure, the conventional model may be tuned to track text data with a slight increase in computational cost.