TOWARD A MORE GLOBAL AND COHERENT SEGMENTATION OF TEXTS

  • Authors:
  • Sylvain Lamprier;Tassadit Amghar;Bernard Levrat;Frédéric Saubion

  • Affiliations:
  • Laboratoire d'Elude et de Recherche en Informatique d'Angers, University of Angers, Angers, France;Laboratoire d'Elude et de Recherche en Informatique d'Angers, University of Angers, Angers, France;Laboratoire d'Elude et de Recherche en Informatique d'Angers, University of Angers, Angers, France;Laboratoire d'Elude et de Recherche en Informatique d'Angers, University of Angers, Angers, France

  • Venue:
  • Applied Artificial Intelligence
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

The automatic text segmentation task consists of identifying the most important thematic breaks in a document in order to cut it into homogeneous passages. Text segmentation has motivated a large amount of research. We focus here on the statistical approaches that rely on an analysis of the distribution of the words in the text. Usually, the segmentation of texts is realized sequentially on the basis of very local clues. However, such an approach prevents the consideration of the text in a global way, particularly concerning the granularity degree adopted for the expression of the different topics it addresses. We thus propose here two new segmentation algorithms-ClassStruggle and SegGen-which use criteria rendering global views of texts. ClassStruggle is based on an initial clustering of the sentences of the text, thus allowing the consideration of similarities within a group rather than individually. It relies on the distribution of the occurrences of the members of each class1 to segment the texts. SegGen proposes to evaluate potential segmentations of the whole text thanks to a genetic algorithm. It attempts to find a solution of segmentation optimizing two criteria, the maximization of the internal cohesion of the segments and the minimization of the similarity between adjacent ones. According to experimental results, both approaches appear to be very competitive compared to existing methods.