Speaker Identification Using Instantaneous Frequencies

  • Authors:
  • M. Grimaldi;F. Cummins

  • Affiliations:
  • Sch. of Comput. Sci. & Inf., Univ. Coll. Dublin, Dublin;-

  • Venue:
  • IEEE Transactions on Audio, Speech, and Language Processing
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper presents an experimental evaluation of different features for use in speaker identification. The features are tested using speech data provided by the chains corpus, in a closed-set speaker identification task. The main objective of the paper is to present a novel parametrization of speech that is based on the AM-FM representation of the speech signal and to assess the utility of these features in the context of speaker identification. In order to explore the extent to which different instantaneous frequencies due to the presence of formants and harmonics in the speech signal may predict a speaker's identity, this work evaluates three different decompositions of the speech signal within the same AM-FM framework: a first setup has been used previously for formant tracking, a second setup is designed to enhance familiar resonances below 4000 Hz, and a third setup is designed to approximate the bandwidth scaling of the filters conventionally used in the extraction of Mel-fequency cepstral coefficients (MFCCs). From each of the proposed setups, parameters are extracted and used in a closed-set text-independent speaker identification task. The performance of the new featural representation is compared with results obtained adopting MFCC and RASTA-PLP features in the context of a generic Gaussian mixture model (GMM) classification system. In evaluating the novel features, we look selectively at information for speaker identification contained in the frequency range 0-4000 Hz and 4000-8000 Hz, as the instantaneous frequencies revealed by the AM-FM approach suggest the presence of structures not well known from conventional spectrographic analyses. Accuracy results obtained using the new parametrization perform as well as conventional MFCC parameters within the same reference system, when tested and trained on modally voiced speech which is mismatched in both channel and style. When the testing material is whispered speech, the new parameters provide better resu- - lts than any of the other features tested, although they remain far from ideal in this limiting case.