A segment selection technique for speaker verification

  • Authors:
  • Mohaddeseh Nosratighods;Eliathamby Ambikairajah;Julien Epps;Michael John Carey

  • Affiliations:
  • School of Electrical Engineering and Telecommunications, UNSW, Sydney, NSW 2052, Australia;School of Electrical Engineering and Telecommunications, UNSW, Sydney, NSW 2052, Australia and National ICT Australia (NICTA), Australian Technology Park, Eveleigh 1430, Australia;School of Electrical Engineering and Telecommunications, UNSW, Sydney, NSW 2052, Australia;Department of Electronic, Electrical and Computer Engineering, The University of Birmingham, Edgbaston, Birmingham B15 2TT, United Kingdom

  • Venue:
  • Speech Communication
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

The performance of speaker verification systems degrades considerably when the test segments are utterances of very short duration. This might be either due to variations in score-matching arising from the unobserved speech sounds of short speech utterances or the fact that the shorter the utterance, the greater the effect of individual speech sounds on the average likelihood score. In other words, the effects of individual speech sounds have not been cancelled out by a large number of speech sounds in very short utterances. This paper presents a score-based segment selection technique for discarding portions of speech that result in poor discrimination ability in a speaker verification task. Theory is developed to detect the most significant and reliable speech segments based on the probability that the test segment comes from a fixed set of cohort models. This approach, suitable for any duration of test utterance, reduces the effect of acoustic regions of the speech that are not accurately modelled due to sparse training data, and makes a decision based only on the segments that provide the best-matched scores from the segment selection algorithm. The proposed segment selection technique provides reductions in relative error rate of 22% and 7% in terms of minimum Detection Cost Function (DCF) and Equal Error Rate (EER) compared with a baseline used the segment-based normalization, when evaluated on the short utterances of NIST 2002 Speaker Recognition Evaluation dataset.