A Subword Normalized Cut Approach to Automatic Story Segmentation of Chinese Broadcast News

  • Authors:
  • Jin Zhang;Lei Xie;Wei Feng;Yanning Zhang

  • Affiliations:
  • Shaanxi Provincial Key Laboratory of Speech and Image Information Processing, School of Computer Science, Northwestern Polytechnical University, Xi'an, China;Shaanxi Provincial Key Laboratory of Speech and Image Information Processing, School of Computer Science, Northwestern Polytechnical University, Xi'an, China;Dept. of Computer Science and Engineering, The Chinese University of Hong Kong,;Shaanxi Provincial Key Laboratory of Speech and Image Information Processing, School of Computer Science, Northwestern Polytechnical University, Xi'an, China

  • Venue:
  • AIRS '09 Proceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper presents a subword normalized cut (N-cut) approach to automatic story segmentation of Chinese broadcast news (BN). We represent a speech recognition transcript using a weighted undirected graph, where the nodes correspond to sentences and the weights of edges describe inter-sentence similarities. Story segmentation is formalized as a graph-partitioning problem under the N-cut criterion, which simultaneously minimizes the similarity across different partitions and maximizes the similarity within each partition. We measure inter-sentence similarities and perform N-cut segmentation on the character/syllable (i.e. subword units) overlapping n-gram sequences. Our method works at the subword levels because subword matching is robust to speech recognition errors and out-of-vocabulary words. Experiments on the TDT2 Mandarin BN corpus show that syllable-bigram-based N-cut achieves the best F1-measure of 0.6911 with relative improvement of 11.52% over previous word-based N-cut that has an F1-measure of 0.6197. N-cut at the subword levels is more effective than the word level for story segmentation of noisy Chinese BN transcripts.