Subword Lexical Chaining for Automatic Story Segmentation in Chinese Broadcast News

Authors:
Lei Xie;Yulian Yang;Jia Zeng
Affiliations:
Audio, Speech and Language Processing Group (ASLP)School of Computer Science, Northwestern Polytechnical University, Xi'an, China;Audio, Speech and Language Processing Group (ASLP)School of Computer Science, Northwestern Polytechnical University, Xi'an, China;Department of Computer Science, Hong Kong Baptist University, Hong Kong,
Venue:
PCM '08 Proceedings of the 9th Pacific Rim Conference on Multimedia: Advances in Multimedia Information Processing
Year:
2008

Citing 5
Cited 2

Segmentation and detection at IBM: hybrid statistical models and two-tiered clustering

Topic detection and tracking
Subword-based approaches for spoken document retrieval

Subword-based approaches for spoken document retrieval
TextTiling: segmenting text into multi-paragraph subtopic passages

Computational Linguistics
SeLeCT: a lexical cohesion based news story segmentation system

AI Communications - STAIRS 2002
Combined use of speaker- and tone-normalized pitch reset with pause duration for automatic story segmentation in Mandarin broadcast news

NAACL-Short '07 Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers

On the effectiveness of subwords for lexical cohesion based story segmentation of Chinese broadcast news

Information Sciences: an International Journal
Complementarity of lexical cohesion and speaker role information for story segmentation of french TV broadcast news

SLSP'13 Proceedings of the First international conference on Statistical Language and Speech Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a subword lexical chaining approach to automatic story segmentation of Chinese broadcast news (BN). Conventional lexical chains link related words with cohesion (e.g. repetition of words) and high concentration points of starting and ending chains are indicative of story boundaries. However, inevitable speech recognition errors in BN transcripts may destroy the cohesiveness of words, resulting in word match failures. We show the robustness of Chinese subwords (characters and syllables) in lexical matching in errorful ASR transcripts. This motivates us to discover story boundaries on chains formed by character and syllable n -gram units. Experimental results on the TDT2 Mandarin corpus show that chaining by character unigram exhibits the best story segmentation performance with relative F -measure improvement of 6.06% over conventional word chaining. Integrations of multi-scales (words and subwords) exhibit further improvement. For example, fusion by voting from different scales achieves an F -measure gain of 9.04% over words.