Korean large vocabulary continuous speech recognition with morpheme-based recognition units

  • Authors:
  • Oh-Wook Kwon;Jun Park

  • Affiliations:
  • Brain Science Research Center, KAIST, 373-1 Guseong-dong, Yuseong-gu, Daejeon 305-701, South Korea;Spoken Language Processing Team, ETRI, 161 Gajeong-dong, Yuseong-gu, Daejeon 305-350, South Korea

  • Venue:
  • Speech Communication
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

In Korean writing, a space is placed between two adjacent word-phrases, each of which generally corresponds to two or three words in English in a semantic sense. If the word-phrase is used as a recognition unit for Korean large vocabulary continuous speech recognition (LVCSR), the out-of-vocabulary (OOV) rate becomes very large. If a morpheme or a syllable is used instead, a severe inter-morpheme coarticulation problem arises due to short morphemes. We propose to use a merged morpheme as the recognition unit and pronunciation-dependent entries in a language model (LM) so that we can reduce such difficulties and incorporate the between-word phonology rule into the decoding algorithm of a Korean LVCSR system. Starting from the original morpheme units defined in the Korean morphology, we merge pairs of short and frequent morphemes into larger units by using a rule-based method and a statistical method. We define the merged morpheme unit as word and use it as the recognition unit. The performance of the system was evaluated in two business-related tasks: a read speech recognition task and a broadcast news transcription task. The OOV rate was reduced to a level comparable to that of American English in both tasks. In the read speech recognition task, with a 32k vocabulary and a word-based trigram LM computed from a newspaper text corpus, the word error rate (WER) of the baseline system was reduced from 25.0% to 20.0% by cross-word modeling and pronunciation-dependent language modeling, and finally to 15.5% by increasing speech database and text corpora. For the broadcast news transcription task, we showed that the statistical method relatively reduced the WER of the baseline system without morpheme merging by 3.4% and both of the proposed methods yielded similar performance. Applying all the proposed techniques, we achieved 17.6% WER for clean speech and 27.7% for noisy speech.