Word distribution based methods for minimizing segment overlaps

Authors:
Joe Vasak;Fei Song
Affiliations:
Department of Computing and Information Science, University of Guelph, Guelph, Ontario, Canada;Department of Computing and Information Science, University of Guelph, Guelph, Ontario, Canada
Venue:
TSD'07 Proceedings of the 10th international conference on Text, speech and dialogue
Year:
2007

Citing 7
Cited 1

Statistical Models for Text Segmentation

Machine Learning - Special issue on natural language learning
Domain-independent text segmentation using anisotropic diffusion and dynamic programming

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Topic segmentation: algorithms and applications

Topic segmentation: algorithms and applications
Advances in domain independent linear text segmentation

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Optimal multi-paragraph text segmentation by dynamic programming

ACL '98 Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics - Volume 2
Multi-paragraph segmentation of expository text

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
A statistical model for domain-independent text segmentation

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics

Coverage-based methods for distributional stopword selection in text segmentation

TSD'10 Proceedings of the 13th international conference on Text, speech and dialogue

Quantified Score

Hi-index	0.00

Visualization

Abstract

Dividing coherent text into a sequence of coherent segments is a challenging task since different topics/subtopics are often related to a common theme(s). Based on lexical cohesion, we can keep track of words and their repetitions and break text into segments at points where the lexical chains are weak. However, there exist words that are more or less evenly distributed across a document (called document-dependent or distributional stopwords), making it difficult to separate one segment from another. To minimize the overlaps between segments, we propose two new measures for removing distributional stopwords based on word distribution. Our experimental results show that the new measures are both efficient to compute and effective for improving the segmentation performance of expository text and transcribed lecture text.