Word distribution based methods for minimizing segment overlaps

  • Authors:
  • Joe Vasak;Fei Song

  • Affiliations:
  • Department of Computing and Information Science, University of Guelph, Guelph, Ontario, Canada;Department of Computing and Information Science, University of Guelph, Guelph, Ontario, Canada

  • Venue:
  • TSD'07 Proceedings of the 10th international conference on Text, speech and dialogue
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Dividing coherent text into a sequence of coherent segments is a challenging task since different topics/subtopics are often related to a common theme(s). Based on lexical cohesion, we can keep track of words and their repetitions and break text into segments at points where the lexical chains are weak. However, there exist words that are more or less evenly distributed across a document (called document-dependent or distributional stopwords), making it difficult to separate one segment from another. To minimize the overlaps between segments, we propose two new measures for removing distributional stopwords based on word distribution. Our experimental results show that the new measures are both efficient to compute and effective for improving the segmentation performance of expository text and transcribed lecture text.