Clustering of Imperfect Transcripts Using a Novel Similarity Measure

  • Authors:
  • Oktay Ibrahimov;Ishwar K. Sethi;Nevenka Dimitrova

  • Affiliations:
  • -;-;-

  • Venue:
  • Information Retrieval Techniques for Speech Applications [this book is based on the workshop “Information Retrieval Techniques for Speech Applications”, held as part of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in New Orleans, USA, in September 2001].
  • Year:
  • 2001

Quantified Score

Hi-index 0.00

Visualization

Abstract

There has been a surge of interest in the last several years in methods for automatic generation of content indices for multimedia documents, particularly with respect to video and audio documents. As a result, there is much interest in methods for analyzing transcribed documents from audio and video broadcasts and telephone conversations and messages. The present paper deals with such an analysis by presenting a clustering technique to partition a set of transcribed documents into different meaningful topics. Our method determines the intersection between matching transcripts, evaluates the information contribution by each transcript, assesses the information closeness of overlapping words and calculates similarity based on Chi-square method. The main novelty of our method lies in the proposed similarity measure that is designed to withstand the imperfections of transcribed documents. Experimental results using documents of varying quality of transcription are presented to demonstrate the efficacy of the proposed methodology.