Topic-Dependent Language Model with Voting on Noun History

Authors:
Welly Naptali;Masatoshi Tsuchiya;Seiichi Nakagawa
Affiliations:
Toyohashi University of Technology;Toyohashi University of Technology;Toyohashi University of Technology
Venue:
ACM Transactions on Asian Language Information Processing (TALIP)
Year:
2010

Citing 10
Cited 0

A Cache-Based Natural Language Model for Speech Recognition

IEEE Transactions on Pattern Analysis and Machine Intelligence
Class-based n-gram models of natural language

Computational Linguistics
Testing the correlation of word error rate and perplexity

Speech Communication
Latent dirichlet allocation

The Journal of Machine Learning Research
Topic identification in discourse

EACL '95 Proceedings of the seventh conference on European chapter of the Association for Computational Linguistics
Factored language models and generalized parallel backoff

NAACL-Short '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: companion volume of the Proceedings of HLT-NAACL 2003--short papers - Volume 2
Language modeling with sentence-level mixtures

HLT '94 Proceedings of the workshop on Human Language Technology
A novel word clustering algorithm based on latent semantic analysis

ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 01
Word Topic Models for Spoken Document Retrieval and Transcription

ACM Transactions on Asian Language Information Processing (TALIP)
Probabilistic latent semantic analysis

UAI'99 Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Language models (LMs) are an important field of study in automatic speech recognition (ASR) systems. LM helps acoustic models find the corresponding word sequence of a given speech signal. Without it, ASR systems would not understand the language and it would be hard to find the correct word sequence. During the past few years, researchers have tried to incorporate long-range dependencies into statistical word-based n-gram LMs. One of these long-range dependencies is topic. Unlike words, topic is unobservable. Thus, it is required to find the meanings behind the words to get into the topic. This research is based on the belief that nouns contain topic information. We propose a new approach for a topic-dependent LM, where the topic is decided in an unsupervised manner. Latent Semantic Analysis (LSA) is employed to reveal hidden (latent) relations among nouns in the context words. To decide the topic of an event, a fixed size word history sequence (window) is observed, and voting is then carried out based on noun class occurrences weighted by a confidence measure. Experiments were conducted on an English corpus and a Japanese corpus: The Wall Street Journal corpus and Mainichi Shimbun (Japanese newspaper) corpus. The results show that our proposed method gives better perplexity than the comparative baselines, including a word-based/class-based n-gram LM, their interpolated LM, a cache-based LM, a topic-dependent LM based on n-gram, and a topic-dependent LM based on Latent Dirichlet Allocation (LDA). The n-best list rescoring was conducted to validate its application in ASR systems.