Automatic discovery of salient segments in imperfect speech transcripts

Authors:
Dulce Ponceleon;Savitha Srinivasan
Affiliations:
IBM Almaden Research Center, San Jose, CA;IBM Almaden Research Center, San Jose, CA
Venue:
Proceedings of the tenth international conference on Information and knowledge management
Year:
2001

Citing 5
Cited 3

Topic labeling of broadcast news stories in the informedia digital video library

Proceedings of the third ACM conference on Digital libraries
Improved algorithms for topic distillation in a hyperlinked environment

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Phonetic confusion matrix based spoken document retrieval

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Speech recognition in the Informedia Digital Video Library: uses and limitations

TAI '95 Proceedings of the Seventh International Conference on Tools with Artificial Intelligence
Multi-paragraph segmentation of expository text

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics

Augmented segmentation and visualization for presentation videos

Proceedings of the 13th annual ACM international conference on Multimedia
VCode and VData: illustrating a new framework for supporting the video annotation workflow

AVI '08 Proceedings of the working conference on Advanced visual interfaces
A3: HCI Coding Guideline for Research Using Video Annotation to Assess Behavior of Nonverbal Subjects with Computer-Based Intervention

ACM Transactions on Accessible Computing (TACCESS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper addresses the problem of automatic detection of salient video segments for real-world applications such as corporate training based on associated speech transcriptions. We present a novel segmentation algorithm based on automatic speech recognition (ASR) applied to the audio track of the video. Our feature set consists of word n-grams extracted from the imperfect speech transcriptions. We use a two-pass algorithm that combines a boundary-based method with a content-based method. In the first pass, we analyze the temporal distribution and the rate of arrival of features to compute an initial segmentation. In the second pass, we detect changes in content-bearing words by using the content-bearing features as queries in an information retrieval system. The content-based second pass validates the initial segments and merges them as needed. Variations in the structure of the audio/video content, and the accuracy of ASR have an impact on the feasibility of the segmentation task. For realistic data we observe that we can identify content-rich segments of the audio. In the best scenario a high-level table-of-contents is generated and in the worse scenario a single salient segment is identified. We illustrate the algorithm in detail with some examples and validate the data with manual segmentation boundaries.