Coverage-based methods for distributional stopword selection in text segmentation

  • Authors:
  • Joe Vasak;Fei Song

  • Affiliations:
  • Department of Computing and Information Science, University of Guelph, Guelph, Ontario, Canada;Department of Computing and Information Science, University of Guelph, Guelph, Ontario, Canada

  • Venue:
  • TSD'10 Proceedings of the 13th international conference on Text, speech and dialogue
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Unlike the common stopwords in information retrieval, distributional stopwords are document-specific and refer to the words that are more or less evenly distributed across a document. Isolating distributional stopwords has been shown to be useful for text segmentation, since it helps improve the representation of a segment by reducing the overlapped words between neighboring segments. In this paper, we propose three new measures for distributional stopword selection and expand the notion of distributional stopwords from the document level to a topic level. Two of our new measures are based on the distributional coverage of a word and the other one is extended from an existing measure called distribution difference by relying on the density of words in a way similar to another measure called distribution significance. Our experiments show that these new measures are not only efficient to compute, but also more accurate than or comparable to the existing measures for distributional stopword selection and that distributional stopword selection at a topic level is more accurate than document level selection for subtopic segmentation.