Enhancing lexical cohesion measure with confidence measures, semantic relations and language model interpolation for multimedia spoken content topic segmentation

Authors:
Camille Guinaudeau;Guillaume Gravier;Pascale Sébillot
Affiliations:
INRIA Rennes, 35000 Rennes, France;CNRS, IRISA, 35000 Rennes, France;INSA, IRISA, 35000 Rennes, France
Venue:
Computer Speech and Language
Year:
2012

Citing 15
Cited 1

Word association norms, mutual information, and lexicography

Computational Linguistics
Statistical Models for Text Segmentation

Machine Learning - Special issue on natural language learning
Intelligent Access to Digital Video: Informedia Project

Computer
Lexical cohesion computed by thesaural relations as an indicator of the structure of text

Computational Linguistics
TextTiling: segmenting text into multi-paragraph subtopic passages

Computational Linguistics
Text segmentation using reiteration and collocation

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
SEXTANT: exploring unexplored contexts for semantic extraction from syntactic analysis

ACL '92 Proceedings of the 30th annual meeting on Association for Computational Linguistics
Word sense disambiguation and text segmentation based on lexical cohesion

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 2
Correcting real-word spelling errors by restoring lexical cohesion

Natural Language Engineering
A statistical model for domain-independent text segmentation

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Minimum cut model for spoken lecture segmentation

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
From efficiency to portability: acquisition of semantic relations by semi-supervised machine learning

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Semantic similarity for detecting recognition errors in automatic speech transcripts

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Topic indexing of TV broadcast news programs

PROPOR'03 Proceedings of the 6th international conference on Computational processing of the Portuguese language
Morpho-syntactic post-processing of N-best lists for improved French automatic speech recognition

Computer Speech and Language

Complementarity of lexical cohesion and speaker role information for story segmentation of french TV broadcast news

SLSP'13 Proceedings of the First international conference on Statistical Language and Speech Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Transcript-based topic segmentation of TV programs faces several difficulties arising from transcription errors, from the presence of potentially short segments and from the limited number of word repetitions to enforce lexical cohesion, i.e., lexical relations that exist within a text to provide a certain unity. To overcome these problems, we extend a probabilistic measure of lexical cohesion based on generalized probabilities with a unigram language model. On the one hand, confidence measures and semantic relations are considered as additional sources of information. On the other hand, language model interpolation techniques are investigated for better language model estimation. Experimental topic segmentation results are presented on two corpora with distinct characteristics, composed respectively of broadcast news and reports on current affairs. Significant improvements are obtained on both corpora, demonstrating the effectiveness of the extended lexical cohesion measure for spoken TV contents, as well as its genericity over different programs.