A unified framework for domain independent online speaker indexing in eigen-voice space using an index tree of reference models

  • Authors:
  • M. H. Moattar;M. M. Homayounpour

  • Affiliations:
  • Department of Software Engineering, Mashhad Branch, Islamic Azad University, Mashhad, Iran;Laboratory for Intelligent Sound and Speech Processing, Computer Engineering and Information Technology Dept., Amirkabir University of Technology, Tehran, Iran

  • Venue:
  • International Journal of Speech Technology
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Speaker indexing referred in literature as speaker diarization is an important task in audio indexing and retrieval. Speaker indexing includes two important and usually separate stages, namely speaker segmentation and speaker clustering. Speaker indexing can be divided into online and offline categories. This paper mainly focuses on domain independent online speaker indexing. For this purpose, the proposed framework should be parameter free and no application specific parameters such as utterance duration or threshold settings are required. To reduce dependency on parameters, the traditional speaker segmentation is reformed to a voting based homogeneous speech segmentation, in which several approaches are applied in parallel to decide on the existence of a change point. In online indexing, data insufficiency is encountered at each time slice. In the proposed framework, a set of reference speaker models are used as side information to facilitate online tracking. To improve the indexing accuracy, adaptation approaches in eigen-voice decomposition space are proposed in this paper. To enhance the tracking performance from the computational cost point of view, an index structure of the reference models is proposed to speed up the search in the model space. The proposed framework is evaluated on the 2002 Rich Transcription Broadcast News and Conversational Telephone Speech Corpus (in Garofolo, NIST Rich Transcription, 2002) as well as a synthetic dataset. The indexing error of the proposed framework on telephone conversations, broadcast news and synthetic dataset are 7.51 %, 6.36 % and 9.34 %, respectively. Also, using the index tree structure approach, the tracking run time of the proposed framework is improved by 32 %.