Detection of speaker individual information using a phoneme effect suppression method

  • Authors:
  • Songgun Hyon;Jianwu Dang;Hui Feng;Hongcui Wang;Kiyoshi Honda

  • Affiliations:
  • School of Computer Science and Technology, Tianjin University, China and School of Computer Science, KimIlSung University, Democratic People's Republic of Korea;School of Computer Science and Technology, Tianjin University, China and School of Information Science, Japan Advanced Institute of Science and Technology, Japan and Tianjin Key Laboratory of Cogn ...;Tianjin Key Laboratory of Cognitive Computing and Application, China and School of Liberal Arts and Law, Tianjin University, China;School of Computer Science and Technology, Tianjin University, China and Tianjin Key Laboratory of Cognitive Computing and Application, China;School of Computer Science and Technology, Tianjin University, China and Tianjin Key Laboratory of Cognitive Computing and Application, China

  • Venue:
  • Speech Communication
  • Year:
  • 2014

Quantified Score

Hi-index 0.00

Visualization

Abstract

Feature extraction of speaker information from speech signals is a key procedure for exploring individual speaker characteristics and also the most critical part in a speaker recognition system, which needs to preserve individual information while attenuating linguistic information. However, it is difficult to separate individual from linguistic information in a given utterance. For this reason, we investigated a number of potential effects on speaker individual information that arise from differences in articulation due to speaker-specific morphology of the speech organs, comparing English, Chinese and Korean. We found that voiced and unvoiced phonemes have different frequency distributions in speaker information and these effects are consistent across the three languages, while the effect of nasal sounds on speaker individuality is language dependent. Because these differences are confounded with speaker individual information, feature extraction is negatively affected. Accordingly, a new feature extraction method is proposed to more accurately detect speaker individual information by suppressing phoneme-related effects, where the phoneme alignment is required once in constructing a filter bank for phoneme effect suppression, but is not necessary in processing feature extraction. The proposed method was evaluated by implementing it in GMM speaker models for speaker identification experiments. It is shown that the proposed approach outperformed both Mel Frequency Cepstrum Coefficient (MFCC) and the traditional F-ratio (FFCC). The use of the proposed feature has reduced recognition errors by 32.1-67.3% for the three languages compared with MFCC, and by 6.6-31% compared with FFCC. When combining an automatic phoneme aligner with the proposed method, the result demonstrated that the proposed method can detect speaker individuality with about the same accuracy as that based on manual phoneme alignment.