On the effectiveness of subwords for lexical cohesion based story segmentation of Chinese broadcast news

Authors:
L. Xie;Y. -L. Yang;Z. -Q. Liu
Affiliations:
Shaanxi Provincial Key Laboratory of Speech and Image Information Processing, School of Computer Science, Northwestern Polytechnical University, Xi'an, China;Shaanxi Provincial Key Laboratory of Speech and Image Information Processing, School of Computer Science, Northwestern Polytechnical University, Xi'an, China;Media Computing Group, School of Creative Media, City University of Hong Kong, Hong Kong
Venue:
Information Sciences: an International Journal
Year:
2011

Citing 25
Cited 3

Statistical Models for Text Segmentation

Machine Learning - Special issue on natural language learning
Prosody-based automatic segmentation of speech into sentences and topics

Speech Communication - Special issue on accessing information in spoken audio
Subword-based approaches for spoken document retrieval

Speech Communication
Text Segmentation by Topic

ECDL '97 Proceedings of the First European Conference on Research and Advanced Technology for Digital Libraries
TextTiling: segmenting text into multi-paragraph subtopic passages

Computational Linguistics
Integrating prosodic and lexical cues for automatic topic segmentation

Computational Linguistics
Advances in domain independent linear text segmentation

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
SeLeCT: a lexical cohesion based news story segmentation system

AI Communications - STAIRS 2002
A statistical framework for fusing mid-level perceptual features in news story segmentation

ICME '03 Proceedings of the 2003 International Conference on Multimedia and Expo - Volume 1
A coupled HMM approach to video-realistic speech animation

Pattern Recognition
Letters: Training T-S norm neural networks to refine weights for fuzzy if-then rules

Neurocomputing
Semantic passage segmentation based on sentence topics for question answering

Information Sciences: an International Journal
Learning fuzzy rules from fuzzy samples based on rough set technique

Information Sciences: an International Journal
Chinese word segmentation as morpheme-based lexical chunking

Information Sciences: an International Journal
Induction of multiple fuzzy decision trees based on rough set technique

Information Sciences: an International Journal
Subword Lexical Chaining for Automatic Story Segmentation in Chinese Broadcast News

PCM '08 Proceedings of the 9th Pacific Rim Conference on Multimedia: Advances in Multimedia Information Processing
Prosody-based topic segmentation for Mandarin broadcast news

HLT-NAACL-Short '04 Proceedings of HLT-NAACL 2004: Short Papers
Story segmentation of brodcast news in English, Mandarin and Arabic

NAACL-Short '06 Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers
Combined use of speaker- and tone-normalized pitch reset with pause duration for automatic story segmentation in Mandarin broadcast news

NAACL-Short '07 Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers
Assessing prosodic and text features for segmentation of Mandarin broadcast news

SpeechIR '04 Proceedings of the Workshop on Interdisciplinary Approaches to Speech Indexing and Retrieval at HLT-NAACL 2004
Cascade Markov random fields for stroke extraction of Chinese characters

Information Sciences: an International Journal
Improving generalization of fuzzy IF-THEN rules by maximizing fuzzy entropy

IEEE Transactions on Fuzzy Systems
Story segmentation and topic classification of broadcast news via a topic-based segmental model and a genetic algorithm

IEEE Transactions on Audio, Speech, and Language Processing
Minimizing the expected complete influence time of a social network

Information Sciences: an International Journal
Multi-scale TextTiling for automatic story segmentation in Chinese broadcast news

AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology

Distance metrics for high dimensional nearest neighborhood recovery: Compression and normalization

Information Sciences: an International Journal
Collapse and reorganization patterns of social knowledge representation in evolving semantic networks

Information Sciences: an International Journal
Unsupervised learning of phonemes of whispered speech in a noisy environment based on convolutive non-negative matrix factorization

Information Sciences: an International Journal

Quantified Score

Hi-index	0.07

Visualization

Abstract

Story segmentation divides a multimedia stream into homogenous regions each addressing a central topic. Lexical cohesion is a reasonable indicator for story boundaries. However, for story segmentation of Chinese broadcast news, directly measuring word level lexical cohesion is not applicable, because the texts transcribed from audio is highly unreliable and the inevitable speech recognition errors may significantly break word cohesion, thus heavily degrading the segmentation performance. To address the problem, we propose to use subword level cohesion in story segmentation of Chinese broadcast news, because Chinese subwords play great semantic roles and show robustness to speech recognition errors. We provide a comprehensive study on the effectiveness of subword units in story segmentation of Chinese speech recognition transcripts, and analyze the influence of recognition errors to the segmentation performance. Specifically, we study subword-based TextTiling and lexical chaining approaches to story segmentation, in which lexical cohesion is measured using either character or syllable n-grams (n=1,2,3,4). Our extensive experiments demonstrate performance improvement of subword unigrams and bigrams over word-based methods. For instance, tested on the CCTV corpus, character unigram lexical chaining obtains a relative F1-measure gain of 12% over words on erroneous brief news transcripts (with word error rate of 40.9%). Generally, we find that subword-based methods can often obtain better segmentation than word-based ones for both error-free and erroneous transcripts.