Statistical methods for speech recognition
Statistical methods for speech recognition
Speech and Language Processing (2nd Edition)
Speech and Language Processing (2nd Edition)
Semantic spaces for improving language modeling
Computer Speech and Language
Hi-index | 0.00 |
This paper describes our research on statistical languagemodeling of Lithuanian. The idea of improving sparse n-gram modelsof highly inflected Lithuanian language by interpolating them withcomplex n-gram models based on word clustering and morphologicalword decomposition was investigated. Words, word base forms andpart-of-speech tags were clustered into 50 to 5000 automaticallygenerated classes. Multiple 3-gram and 4-gram class-based languagemodels were built and evaluated on Lithuanian text corpus, whichcontained 85 million words. Class-based models linearlyinterpolated with the 3-gram model led up to a 13% reduction in theperplexity compared with the baseline 3-gram model. Morphologicalmodels decreased out-of-vocabulary word rate from 1.5% to1.02%.