Extractive speech summarization using shallow rhetorical structure modeling

  • Authors:
  • Justin Jian Zhang;Ricky Ho Yin Chan;Pascale Fung

  • Affiliations:
  • Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong;Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong;Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong

  • Venue:
  • IEEE Transactions on Audio, Speech, and Language Processing
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

We propose an extractive summarization approach with a novel shallow rhetorical structure learning framework for speech summarization. One of the most under-utilized features in extractive summarization is hierarchical structure information-semantically cohesive units that are hidden in spoken documents. We first present empirical evidence that rhetorical structure is the underlying semantic information, which is rendered in linguistic and acoustic/prosodic forms in lecture speech. A segmental summarization method, where the document is partitioned into rhetorical units by K-means clustering, is first proposed to test this hypothesis. We show that this system produces summaries at 67.36% ROUGE-L F-measure, a 4.29% absolute increase in performance compared with that of the baseline system. We then propose Rhetorical-State Hidden Markov Models (RSHMMs) to automatically decode the underlying hierarchical rhetorical structure in speech. Tenfold cross validation experiments are carried out on conference speeches. We show that system based on RSHMMs gives a 71.31% ROUGE-L F-measure, a 8.24% absolute increase in lecture speech summarization performance compared with the baseline system without using RSHMM. Our method equally outperforms the baseline with a conventional discourse feature. We also present a thorough investigation of the relative contribution of different features and show that, for lecture speech, speaker-normalized acoustic features give the most contribution at 68.5% ROUGE-L F-measure, compared to 62.9% ROUGE-L F-measure for linguistic features, and 59.2% ROUGE-L F-measure for un-normalized acoustic features. This shows that the individual speaking style of each speaker is highly relevant to the summarization.