Subword-based approaches for spoken document retrieval
Subword-based approaches for spoken document retrieval
TextTiling: segmenting text into multi-paragraph subtopic passages
Computational Linguistics
Story segmentation of brodcast news in English, Mandarin and Arabic
NAACL-Short '06 Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers
NAACL-Short '07 Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers
A Heuristic Approach to Caption Enhancement for Effective Video OCR
ICIC '08 Proceedings of the 4th international conference on Intelligent Computing: Advanced Intelligent Computing Theories and Applications - with Aspects of Theoretical and Methodological Issues
A Subword Normalized Cut Approach to Automatic Story Segmentation of Chinese Broadcast News
AIRS '09 Proceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology
Information Sciences: an International Journal
ACM Transactions on Speech and Language Processing (TSLP)
Hi-index | 0.00 |
This paper applies Chinese subword representations, namely character and syllable n-grams, into the TextTiling-based automatic story segmentation of Chinese broadcast news. We show the robustness of Chinese subwords against speech recognition errors, out-of-vocabulary (OOV) words and versatility in word segmentation in lexical matching on errorful Chinese speech recognition transcripts. We propose a multi-scale TextTiling approach that integrates both the specificity of words and the robustness of subwords in lexical similarity measure for story boundary identification. Experiments on the TDT2 Mandarin corpus show that subword bigrams achieve the best performance among all scales with relative f -measure improvement of 8.84% (character bigram) and 7.11% (syllable bigram) over words. Multi-scale fusion of subword bigrams with words can bring further improvement. It is promising that the integration of syllable bigram with syllable sequence of word achieves an f -measure gain of 2.66% over the syllable bigram alone.