ClassStruggle: a clustering based text segmentation

Authors:
Sylvain Lamprier;Tassadit Amghar;Bernard Levrat;Frederic Saubion
Affiliations:
Université Angers, Angers, France;Université Angers, Angers, France;Université Angers, Angers, France;Université Angers, Angers, France
Venue:
Proceedings of the 2007 ACM symposium on Applied computing
Year:
2007

Citing 13
Cited 0

Information retrieval: data structures and algorithms

Information retrieval: data structures and algorithms
Overview of the first TREC conference

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Automatic text decomposition using text segments and text themes

Proceedings of the the seventh ACM conference on Hypertext
Using sentence-selection heuristics to rank text segments in TXTRACTOR

Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries
Modern Information Retrieval

Modern Information Retrieval
A critique and improvement of an evaluation metric for text segmentation

Computational Linguistics
Topic segmentation: algorithms and applications

Topic segmentation: algorithms and applications
Lexical cohesion computed by thesaural relations as an indicator of the structure of text

Computational Linguistics
TextTiling: segmenting text into multi-paragraph subtopic passages

Computational Linguistics
Advances in domain independent linear text segmentation

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Text segmentation based on similarity between words

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
A statistical model for domain-independent text segmentation

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Discourse segmentation of multi-party conversation

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes ClassStruggle, an algorithm for linear text segmentation on general corpuses. It relies on an initial clustering of the sentences of the text. This preliminary partitioning provides a global view on the sentences relations existing in the text, considering the similarities in a group rather than individually. ClassStruggle is based on the distribution of the occurrences of the members of each class. During the process, the clusters then evolve, by considering a notion of proximity and of layout in the text, in the aim to create groups that contain only sentences related to a same topic development. Finally, boundaries are created between sentences belonging to two different classes. First experimental results are promising, ClassStruggle appears to be very competitive compared with existing methods.