A New Word Clustering Method for Building N-Gram Language Models in Continuous Speech Recognition Systems

Authors:
Mohammad Bahrani;Hossein Sameti;Nazila Hafezi;Saeedeh Momtazi
Affiliations:
Speech Processing Lab, Computer Engineering Department, Sharif University of Technology, Tehran, Iran;Speech Processing Lab, Computer Engineering Department, Sharif University of Technology, Tehran, Iran;Speech Processing Lab, Computer Engineering Department, Sharif University of Technology, Tehran, Iran;Speech Processing Lab, Computer Engineering Department, Sharif University of Technology, Tehran, Iran
Venue:
IEA/AIE '08 Proceedings of the 21st international conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems: New Frontiers in Applied Artificial Intelligence
Year:
2008

Citing 3
Cited 1

Fundamentals of speech recognition

Fundamentals of speech recognition
Class-based n-gram models of natural language

Computational Linguistics
Algorithms for bigram and trigram word clustering

Speech Communication

Half-context language models

Computational Linguistics

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper a new method for automatic word clustering is presented. We used this method for building n-gram language models for Persian continuous speech recognition (CSR) systems. In this method, each word is specified by a feature vector that represents the statistics of parts of speech (POS) of that word. The feature vectors are clustered by k-means algorithm. Using this method causes a reduction in time complexity which is a defect in other automatic clustering methods. Also, the problem of high perplexity in manual clustering methods is abated. The experimental results are based on "Persian Text Corpus" which contains about 9 million words. The extracted language models are evaluated by the perplexity criterion and the results show that a considerable reduction in perplexity has been achieved. Also reduction in word error rate of CSR system is about 16% compared with a manual clustering method.