Coverage-based methods for distributional stopword selection in text segmentation

Authors:
Joe Vasak;Fei Song
Affiliations:
Department of Computing and Information Science, University of Guelph, Guelph, Ontario, Canada;Department of Computing and Information Science, University of Guelph, Guelph, Ontario, Canada
Venue:
TSD'10 Proceedings of the 13th international conference on Text, speech and dialogue
Year:
2010

Citing 9
Cited 0

Statistical Models for Text Segmentation

Machine Learning - Special issue on natural language learning
Domain-independent text segmentation using anisotropic diffusion and dynamic programming

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Topic segmentation: algorithms and applications

Topic segmentation: algorithms and applications
Advances in domain independent linear text segmentation

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
A maximum entropy approach to identifying sentence boundaries

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Multi-paragraph segmentation of expository text

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
A statistical model for domain-independent text segmentation

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Minimum cut model for spoken lecture segmentation

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Word distribution based methods for minimizing segment overlaps

TSD'07 Proceedings of the 10th international conference on Text, speech and dialogue

Quantified Score

Hi-index	0.00

Visualization

Abstract

Unlike the common stopwords in information retrieval, distributional stopwords are document-specific and refer to the words that are more or less evenly distributed across a document. Isolating distributional stopwords has been shown to be useful for text segmentation, since it helps improve the representation of a segment by reducing the overlapped words between neighboring segments. In this paper, we propose three new measures for distributional stopword selection and expand the notion of distributional stopwords from the document level to a topic level. Two of our new measures are based on the distributional coverage of a word and the other one is extended from an existing measure called distribution difference by relying on the density of words in a way similar to another measure called distribution significance. Our experiments show that these new measures are not only efficient to compute, but also more accurate than or comparable to the existing measures for distributional stopword selection and that distributional stopword selection at a topic level is more accurate than document level selection for subtopic segmentation.